Unicode character classes and conversions

Some of the procedures that operate on characters or strings ignore the difference between upper case and lower case. These procedures have -ci (for “case insensitive”) embedded in their names.

Characters

Procedure: char-upcase char

Procedure: char-downcase char

Procedure: char-titlecase char

Procedure: char-foldcase char

These procedures take a character argument and return a character result.

If the argument is an upper–case or title–case character, and if there is a single character that is its lower–case form, then char-downcase returns that character.

If the argument is a lower–case or title–case character, and there is a single character that is its upper–case form, then char-upcase returns that character.

If the argument is a lower–case or upper–case character, and there is a single character that is its title–case form, then char-titlecase returns that character.

If the argument is not a title–case character and there is no single character that is its title–case form, then char-titlecase returns the upper–case form of the argument.

Finally, if the character has a case–folded character, then char-foldcase returns that character. Otherwise the character returned is the same as the argument.

For Turkic characters #\x130 and #\x131, char-foldcase behaves as the identity function; otherwise char-foldcase is the same as char-downcase composed with char-upcase.

(char-upcase #\i)               ⇒  #\I
(char-downcase #\i)             ⇒  #\i
(char-titlecase #\i)            ⇒  #\I
(char-foldcase #\i)             ⇒  #\i

(char-upcase #\ß)               ⇒  #\ß
(char-downcase #\ß)             ⇒  #\ß
(char-titlecase #\ß)            ⇒  #\ß
(char-foldcase #\ß)             ⇒  #\ß

(char-upcase #\Σ)               ⇒  #\Σ
(char-downcase #\Σ)             ⇒  #\σ
(char-titlecase #\Σ)            ⇒  #\Σ
(char-foldcase #\Σ)             ⇒  #\σ

(char-upcase #\ς)               ⇒  #\Σ
(char-downcase #\ς)             ⇒  #\ς
(char-titlecase #\ς)            ⇒  #\Σ
(char-foldcase #\ς)             ⇒  #\σ

Note: char-titlecase does not always return a title–case character.

Note: These procedures are consistent with Unicode’s locale–independent mappings from scalar values to scalar values for upcase, downcase, titlecase, and case–folding operations. These mappings can be extracted from UnicodeData.txt and CaseFolding.txt from the Unicode Consortium, ignoring Turkic mappings in the latter.

Note that these character–based procedures are an incomplete approximation to case conversion, even ignoring the user’s locale. In general, case mappings require the context of a string, both in arguments and in result. The string-upcase, string-downcase, string-titlecase, and string-foldcase procedures perform more general case conversion.

Procedure: char-ci=? char1 char2 char3

Procedure: char-ci<? char1 char2 char3

Procedure: char-ci>? char1 char2 char3

Procedure: char-ci<=? char1 char2 char3

Procedure: char-ci>=? char1 char2 char3

These procedures are similar to char=?, etc., but operate on the case–folded versions of the characters.

(char-ci<? #\z #\Z)             ⇒ #f
(char-ci=? #\z #\Z)             ⇒ #f
(char-ci=? #\ς #\σ)             ⇒ #t

Procedure: char-alphabetic? char

Procedure: char-numeric? char

Procedure: char-whitespace? char

Procedure: char-upper-case? char

Procedure: char-lower-case? char

Procedure: char-title-case? char

These procedures return #t if their arguments are alphabetic, numeric, whitespace, upper–case, lower–case, or title–case characters, respectively; otherwise they return #f.

A character is alphabetic if it has the Unicode “Alphabetic” property. A character is numeric if it has the Unicode “Numeric” property. A character is whitespace if has the Unicode “White_Space” property. A character is upper case if it has the Unicode “Uppercase” property, lower case if it has the “Lowercase” property, and title case if it is in the Lt general category.

(char-alphabetic? #\a)          ⇒  #t
(char-numeric? #\1)             ⇒  #t
(char-whitespace? #\space)      ⇒  #t
(char-whitespace? #\x00A0)      ⇒  #t
(char-upper-case? #\Σ)          ⇒  #t
(char-lower-case? #\σ)          ⇒  #t
(char-lower-case? #\x00AA)      ⇒  #t
(char-title-case? #\I)          ⇒  #f
(char-title-case? #\x01C5)      ⇒  #t

Procedure: char-general-category char

Return a symbol representing the Unicode general category of char, one of Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Ps, Pe, Pi, Pf, Pd, Pc, Po, Sc, Sm, Sk, So, Zs, Zp, Zl, Cc, Cf, Cs, Co, or Cn.

(char-general-category #\a)         ⇒ Ll
(char-general-category #\space)     ⇒ Zs
(char-general-category #\x10FFFF)   ⇒ Cn  

Strings

Procedure: string-upcase string

Procedure: string-downcase string

Procedure: string-titlecase string

Procedure: string-foldcase string

These procedures take a string argument and return a string result. They are defined in terms of Unicode’s locale–independent case mappings from Unicode scalar–value sequences to scalar–value sequences. In particular, the length of the result string can be different from the length of the input string. When the specified result is equal in the sense of string=? to the argument, these procedures may return the argument instead of a newly allocated string.

The string-upcase procedure converts a string to upper case; string-downcase converts a string to lower case. The string-foldcase procedure converts the string to its case–folded counterpart, using the full case–folding mapping, but without the special mappings for Turkic languages. The string-titlecase procedure converts the first cased character of each word, and downcases all other cased characters.

(string-upcase "Hi")              ⇒ "HI"
(string-downcase "Hi")            ⇒ "hi"
(string-foldcase "Hi")            ⇒ "hi"

(string-upcase "Straße")          ⇒ "STRASSE"
(string-downcase "Straße")        ⇒ "straße"
(string-foldcase "Straße")        ⇒ "strasse"
(string-downcase "STRASSE")       ⇒ "strasse"

(string-downcase "Σ")             ⇒ "σ"
; Chi Alpha Omicron Sigma:
(string-upcase "ΧΑΟΣ")            ⇒ "ΧΑΟΣ"
(string-downcase "ΧΑΟΣ")          ⇒ "χαος"
(string-downcase "ΧΑΟΣΣ")         ⇒ "χαοσς"
(string-downcase "ΧΑΟΣ Σ")        ⇒ "χαος σ"
(string-foldcase "ΧΑΟΣΣ")         ⇒ "χαοσσ"
(string-upcase "χαος")            ⇒ "ΧΑΟΣ"
(string-upcase "χαοσ")            ⇒ "ΧΑΟΣ"

(string-titlecase "kNock KNoCK")  ⇒ "Knock Knock"
(string-titlecase "who's there?") ⇒ "Who's There?"
(string-titlecase "r6rs")         ⇒ "R6rs"
(string-titlecase "R6RS")         ⇒ "R6rs"

Note: The case mappings needed for implementing these procedures can be extracted from UnicodeData.txt, SpecialCasing.txt, WordBreakProperty.txt (the “MidLetter” property partly defines case–ignorable characters), and CaseFolding.txt from the Unicode Consortium.

Since these procedures are locale–independent, they may not be appropriate for some locales.

Note: Word breaking, as needed for the correct casing of the upper case greek sigma and for string-titlecase, is specified in Unicode Standard Annex #29.

Kawa Note: The implementation of string-titlecase does not correctly handle the case where an initial character needs to be converted to multiple characters, such as “LATIN SMALL LIGATURE FL” which should be converted to the two letters "Fl".

Procedure: string-ci=? string1 string2 string3

Procedure: string-ci<? string1 string2 string3

Procedure: string-ci>? string1 string2 string3

Procedure: string-ci<=? string1 string2 string3

Procedure: string-ci>=? string1 string2 string3

These procedures are similar to string=?, etc., but operate on the case–folded versions of the strings.

(string-ci<? "z" "Z")                   ⇒ #f
(string-ci=? "z" "Z")                   ⇒ #t
(string-ci=? "Straße" "Strasse")        ⇒ #t
(string-ci=? "Straße" "STRASSE")        ⇒ #t
(string-ci=? "ΧΑΟΣ" "χαοσ")             ⇒ #t

Procedure: string-normalize-nfd string

Procedure: string-normalize-nfkd string

Procedure: string-normalize-nfc string

Procedure: string-normalize-nfkc string

These procedures take a string argument and return a string result, which is the input string normalized to Unicode normalization form D, KD, C, or KC, respectively. When the specified result is equal in the sense of string=? to the argument, these procedures may return the argument instead of a newly allocated string.

(string-normalize-nfd "\xE9;")          ⇒ "\x65;\x301;"
(string-normalize-nfc "\xE9;")          ⇒ "\xE9;"
(string-normalize-nfd "\x65;\x301;")    ⇒ "\x65;\x301;"
(string-normalize-nfc "\x65;\x301;")    ⇒ "\xE9;"

Procedure: string-for-each proc string1 string2

The strings must all have the same length. proc should accept as many arguments as there are strings.

The string-for-each procedure applies proc element–wise to the characters of the strings for its side effects, in order from the first characters to the last. proc is always called in the same dynamic environment as string-for-each itself.

Analogous to for-each.

(let ((v '()))
  (string-for-each
    (lambda (c) (set! v (cons (char->integer c) v)))
    "abcde")
   v)
  ⇒ (101 100 99 98 97)

Deprecated in-place case modification

The following functions are deprecated; they really don’t and cannot do the right thing, because in some languages upper and lower case can use different number of characters.

Procedure: string-upcase! str

Deprecated: Destructively modify str, replacing the letters by their upper-case equivalents.

Procedure: string-downcase! str

Deprecated: Destructively modify str, replacing the letters by their upper-lower equivalents.

Procedure: string-capitalize! str

Deprecated: Destructively modify str, such that the letters that start a new word are replaced by their title-case equivalents, while non-initial letters are replaced by their lower-case equivalents.