Strings

Strings are sequences of characters. The length of a string is the number of characters that it contains, as an exact non-negative integer. This number is fixed when the string is created. The valid indices of a string are the exact non-negative integers less than the length of the string. The first character of a string has index 0, the second has index 1, and so on.

Some of the procedures that operate on strings ignore the difference between upper and lower case. The names of the versions that ignore case end with “-ci” (for “case insensitive”).

Kawa note: Kawa’s implementation of strings that contain surrogate characters does not quite follow the R6RS specification. Specifically indexing into such a string retrieves a surrogate rather than a Unicode scalar value. It is not clear what the best solution is - there is a tradeoff between performance, compatibility with R6RS, and interoperability with Java APIs.

String literals

string ::= "string-element*"
string-element ::= any character other than " or \
    | mnemonic-escape | \" | \\
    | \intraline-whitespace*line-ending intraline-whitespace*
    | inline-hex-escape
mnemonic-escape ::= \a | \b | \t | \n | \r | ... (see below)

A string is written as a sequence of characters enclosed within quotation marks ("). Within a string literal, various escape sequence represent characters other than themselves. Escape sequences always start with a backslash (\):

\a

Alarm (bell), #\x0007.

\b

Backspace, #\x0008.

\e

Escape, #\x001B.

\f

Form feed, #\x000C.

\n

Linefeed (newline), #\x000A.

\r

Return, #\x000D.

\t

Character tabulation, #\x0009.

\v

Vertical tab, #\x000B.

\C-x
\^x

Returns the scalar value of x masked (anded) with #x9F. An alternative way to write the Ascii control characters: For example "\C-m" or "\^m" is the same as "#\x000D" (which the same as "\r"). As a special case \^? is rubout (delete) (\x7f;).

\x hex-scalar-value;
\X hex-scalar-value;

A hex encoding that gives the scalar value of a character.

\\ oct-digit+

At most three octal digits that give the scalar value of a character. (Historical, for C compatibility.)

\u hex-digit+

Exactly four hex digits that give the scalar value of a character. (Historical, for Java compatibility.)

\M-x

(Historical, for Emacs Lisp.) Set the meta-bit (high-bit of single byte) of the following character x.

\|

Vertical line, #\x007c. (Not useful for string literals, but useful for symbols.)

\"

Double quote, #\x0022.

\\

Backslah, #\005C.

\intraline-whitespace*line-ending intraline-whitespace*

Nothing (ignored). Allows you to split up a long string over multiple lines; ignoring initial whitespace on the continuation lines allows you to indent them.

Except for a line ending, any character outside of an escape sequence stands for itself in the string literal. A line ending which is preceded by \intraline-whitespace* expands to nothing (along with any trailing intraline-whitespace), and can be used to indent strings for improved legibility. Any other line ending has the same effect as inserting a \n character into the string.

Examples:

"The word \"recursion\" has many meanings."
"Another example:\ntwo lines of text"
"Here’s text \
containing just one line"
"\x03B1; is named GREEK SMALL LETTER ALPHA."

String templates

The following syntax is a string template (also called a string quasi-literal or “here document”):

&{Hello &[name]!}

Assuming the variable name evaluates to "John" then the example evaluates to "Hello John!".

The Kawa reader converts the above example to:

($string$ "Hello " $<<$ name $>>$ "!")

See the SRFI-109 specification for details.

extended-string-literal ::= &{ [initial-ignoredstring-literal-part* }
string-literal-part ::=  any character except &{ or }
    | { string-literal-part* }
    | char-ref
    | entity-ref
    | special-escape
    | enclosed-part

You can use the plain "string" syntax for longer multiline strings, but &{string} has various advantages. The syntax is less error-prone because the start-delimiter is different from the end-delimiter. Also note that nested braces are allowed: a right brace } is only an end-delimiter if it is unbalanced, so you would seldom need to escape it:

&{This has a {braced} section.}
  ⇒ "This has a {braced} section."

The escape character used for special characters is &. This is compatible with XML syntax and XML literals.

Special characters

char-ref ::=
    &# digit+ ;
  | &#x hex-digit+  ;
entity-ref ::=
    & char-or-entity-name ;
char-or-entity-name ::= tagname

You can the standard XML syntax for character references, using either decimal or hexadecimal values. The following string has two instances of the Ascii escape character, as either decimal 27 or hex 1B:

&{&#27;&#x1B;} ⇒ "\e\e"

You can also use the pre-defined XML entity names:

&{&amp; &lt; &gt; &quot; &apos;} ⇒ "& < > \" '"

In addition, &lbrace; &rbrace; can be used for left and right curly brace, though you don’t need them for balanced parentheses:

&{ &rbrace;_&lbrace; / {_} }  ⇒ " }_{ / {_} "

You can use the standard XML entity names. For example:

&{L&aelig;rdals&oslash;yri}
  ⇒ "Lærdalsøyri"

You can also use the standard R7RS character names null, alarm, backspace, tab, newline, return, escape, space, and delete. For example:

&{&escape;&space;}

The syntax &name; is actually syntactic sugar (specifically reader syntax) to the variable reference $entity$:name. Hence you can also define your own entity names:

(define $entity$:crnl "\r\n")
&{&crnl;} ⟹ "\r\n"

Multiline string literals

initial-ignored ::=
    intraline-whitespace* line-ending intraline-whitespace* &|
special-escape ::=
    intraline-whitespace* &|
  | & nested-comment
  | &- intraline-whitespace* line-ending

A line-ending directly in the text is becomes a newline, as in a simple string literal:

(string-capitalize &{one two three
uno dos tres
}) ⇒ "One Two Three\nUno Dos Tres\n"

However, you have extra control over layout. If the string is in a nested expression, it is confusing (and ugly) if the string cannot be indented to match the surrounding context. The indentation marker &| is used to mark the end of insignificant initial whitespace. The &| characters and all the preceding whitespace are removed. In addition, it also suppresses an initial newline. Specifically, when the initial left-brace is followed by optional (invisible) intraline-whitespace, then a newline, then optional intraline-whitespace (the indentation), and finally the indentation marker &| - all of which is removed from the output. Otherwise the &| only removes initial intraline-whitespace on the same line (and itself).

(write (string-capitalize &{
     &|one two three
     &|uno dos tres
}) out)
    ⇒ prints "One Two Three\nUno Dos Tres\n"

As a matter of style, all of the indentation lines should line up. It is an error if there are any non-whitespace characters between the previous newline and the indentation marker. It is also an error to write an indentation marker before the first newline in the literal.

The line-continuation marker &- is used to suppress a newline:

&{abc&-
  def} ⇒ "abc  def"

You can write a #|...|#-style comment following a &. This could be useful for annotation, or line numbers:

&{&#|line 1|#one two
  &#|line 2|# three
  &#|line 3|#uno dos tres
} ⇒ "one two\n three\nuno dos tres\n"

Embedded expressions

enclosed-part ::=
    & enclosed-modifier [ expression* ]
  | & enclosed-modifier ( expression+ )

An embedded expression has the form &[expression]. It is evaluated, the result converted to a string (as by display), and the result added in the result string. (If there are multiple expressions, they are all evaluated and the corresponding strings inserted in the result.)

&{Hello &[(string-capitalize name)]!}

You can leave out the square brackets when the expression is a parenthesized expression:

&{Hello &(string-capitalize name)!}

Formatting

enclosed-modifier ::=
  ~ format-specifier-after-tilde*

Using format allows finer-grained control over the output, but a problem is that the association between format specifiers and data expressions is positional, which is hard-to-read and error-prone. A better solution places the specifier adjacant to the data expression:

&{The response was &~,2f(* 100.0 (/ responses total))%.}

The following escape forms are equivalent to the corresponding forms withput the ~fmt-spec, except the expression(s) are formatted using format:

&~fmt-spec[expression*] 

Again using parentheses like this:

&~fmt-spec(expression+)

is equivalent to:

&~fmt-spec[(expression+)]

The syntax of format specifications is arcane, but it allows you to do some pretty neat things in a compact space. For example to include "_" between each element of the array arr you can use the ~{...~} format speciers:

(define arr [5 6 7])
&{&~{&[arr]&~^_&~}} ⇒ "5_6_7"

If no format is specified for an enclosed expression, the that is equivalent to a ~a format specifier, so this is equivalent to:

&{&~{&~a[arr]&~^_&~}} ⇒ "5_6_7"

which is in turn equivalent to:

(format #f "~{~a~^_~}" arr)

The fine print that makes this work: If there are multiple expressions in a &[...] with no format specifier then there is an implicit ~a for each expression. On the other hand, if there is an explicit format specifier, it is not repeated for each enclosed expression: it appears exactly once in the effective format string, whether there are zero, one, or many expressions.

Basic string procedures

Procedure: string? obj

Return #t if obj is a string, #f otherwise.

Type: string

The type of string objects.

Constructor: string char

Return a newly allocated string composed of the arguments. This is analogous to list. The underlying type is the interface java.lang.CharSequence. Immultable strings are java.lang.String, while mutable strings are gnu.lists.FString.

Procedure: make-string k

Procedure: make-string k char

Return a newly allocated string of length k. If char is given, then all elements of the string are initialized to char, otherwise the contents of the string are unspecified.

Procedure: string-length string

Return the number of characters in the given string as an exact integer object.

Procedure: string-ref string k

k must be a valid index of string. The string-ref procedure returns character k of string using zero–origin indexing.

Procedure: string-set! string k char

This procedure stores char in element k of string.

(define s1 (make-string 3 #\*))
(define s2 "***")
(string-set! s1 0 #\?) ⇒ void
s1 ⇒ "?**"
(string-set! s2 0 #\?) ⇒ error
(string-set! (symbol->string 'immutable) 0 #\?) ⇒ error

Procedure: substring string start end

string must be a string, and start and end must be exact integer objects satisfying:

0 <= start <= end <= (string-length string)

The substring procedure returns a newly allocated string formed from the characters of string beginning with index start (inclusive) and ending with index end (exclusive).

Procedure: string-append string

Return a newly allocated string whose characters form the concatenation of the given strings.

Procedure: string->list string [start [end]]

Procedure: list->string list

It is an error if any element of list is not a character.

The string->list procedure returns a newly allocated list of the characters of string between start and end. The list->string procedure returns a newly allocated string formed from the characters in list. In both procedures, order is preserved. The string->list and list->string procedures are inverses so far as equal? is concerned.

Procedure: string-for-each proc string1 string2

The strings must all have the same length. proc should accept as many arguments as there are strings.

The string-for-each procedure applies proc element–wise to the characters of the strings for its side effects, in order from the first characters to the last. proc is always called in the same dynamic environment as string-for-each itself.

Analogous to for-each.

(let ((v '()))
  (string-for-each
    (lambda (c) (set! v (cons (char->integer c) v)))
    "abcde")
   v)
  ⇒ (101 100 99 98 97)

Procedure: string-copy string [start [end]]

Returns a newly allocated copy of the the part of the given string between start and end.

Procedure: string-copy! to at from [start [end]]

Copies the characters of the string from that are between start end end into the string to, starting at index at. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes places as if the source is first copied into a temporary string and then into the destination. (This is achieved without allocating storage by making sure to copy in the correct direction in such circumstances.)

(define a "12345")
(define b (string-copy "abcde"))
(string-copy! b 1 a 0 2)
b  ⇒  "a12de"

Procedure: string-fill! string fill [start [end]]

The string-fill! procedure stores fill in the elements of string between start and end. It is an error if fill is not a character or is forbidden in strings.

String Comparisons

Procedure: string=? string1 string2 string3

Return #t if the strings are the same length and contain the same characters in the same positions. Otherwise, the string=? procedure returns #f.

(string=? "Straße" "Strasse")    ⇒ #f

Procedure: string<? string1 string2 string3

Procedure: string>? string1 string2 string3

Procedure: string<=? string1 string2 string3

Procedure: string>=? string1 string2 string3

These procedures return #t if their arguments are (respectively): monotonically increasing, monotonically decreasing, monotonically non-decreasing, or monotonically nonincreasing. These predicates are required to be transitive.

These procedures are the lexicographic extensions to strings of the corresponding orderings on characters. For example, string<? is the lexicographic ordering on strings induced by the ordering char<? on characters. If two strings differ in length but are the same up to the length of the shorter string, the shorter string is considered to be lexicographically less than the longer string.

(string<? "z" "ß")      ⇒ #t
(string<? "z" "zz")     ⇒ #t
(string<? "z" "Z")      ⇒ #f

Procedure: string-ci=? string1 string2 string3

Procedure: string-ci<? string1 string2 string3

Procedure: string-ci>? string1 string2 string3

Procedure: string-ci<=? string1 string2 string3

Procedure: string-ci>=? string1 string2 string3

These procedures are similar to string=?, etc., but behave as if they applied string-foldcase to their arguments before invokng the corresponding procedures without -ci.

(string-ci<? "z" "Z")                   ⇒ #f
(string-ci=? "z" "Z")                   ⇒ #t
(string-ci=? "Straße" "Strasse")        ⇒ #t
(string-ci=? "Straße" "STRASSE")        ⇒ #t
(string-ci=? "ΧΑΟΣ" "χαοσ")             ⇒ #t

String Conversions

Procedure: string-upcase string

Procedure: string-downcase string

Procedure: string-titlecase string

Procedure: string-foldcase string

These procedures take a string argument and return a string result. They are defined in terms of Unicode’s locale–independent case mappings from Unicode scalar–value sequences to scalar–value sequences. In particular, the length of the result string can be different from the length of the input string. When the specified result is equal in the sense of string=? to the argument, these procedures may return the argument instead of a newly allocated string.

The string-upcase procedure converts a string to upper case; string-downcase converts a string to lower case. The string-foldcase procedure converts the string to its case–folded counterpart, using the full case–folding mapping, but without the special mappings for Turkic languages. The string-titlecase procedure converts the first cased character of each word, and downcases all other cased characters.

(string-upcase "Hi")              ⇒ "HI"
(string-downcase "Hi")            ⇒ "hi"
(string-foldcase "Hi")            ⇒ "hi"

(string-upcase "Straße")          ⇒ "STRASSE"
(string-downcase "Straße")        ⇒ "straße"
(string-foldcase "Straße")        ⇒ "strasse"
(string-downcase "STRASSE")       ⇒ "strasse"

(string-downcase "Σ")             ⇒ "σ"
; Chi Alpha Omicron Sigma:
(string-upcase "ΧΑΟΣ")            ⇒ "ΧΑΟΣ"
(string-downcase "ΧΑΟΣ")          ⇒ "χαος"
(string-downcase "ΧΑΟΣΣ")         ⇒ "χαοσς"
(string-downcase "ΧΑΟΣ Σ")        ⇒ "χαος σ"
(string-foldcase "ΧΑΟΣΣ")         ⇒ "χαοσσ"
(string-upcase "χαος")            ⇒ "ΧΑΟΣ"
(string-upcase "χαοσ")            ⇒ "ΧΑΟΣ"

(string-titlecase "kNock KNoCK")  ⇒ "Knock Knock"
(string-titlecase "who's there?") ⇒ "Who's There?"
(string-titlecase "r6rs")         ⇒ "R6rs"
(string-titlecase "R6RS")         ⇒ "R6rs"

Note: The case mappings needed for implementing these procedures can be extracted from UnicodeData.txt, SpecialCasing.txt, WordBreakProperty.txt (the “MidLetter” property partly defines case–ignorable characters), and CaseFolding.txt from the Unicode Consortium.

Since these procedures are locale–independent, they may not be appropriate for some locales.

Note: Word breaking, as needed for the correct casing of the upper case greek sigma and for string-titlecase, is specified in Unicode Standard Annex #29.

Kawa Note: The implementation of string-titlecase does not correctly handle the case where an initial character needs to be converted to multiple characters, such as “LATIN SMALL LIGATURE FL” which should be converted to the two letters "Fl".

Procedure: string-normalize-nfd string

Procedure: string-normalize-nfkd string

Procedure: string-normalize-nfc string

Procedure: string-normalize-nfkc string

These procedures take a string argument and return a string result, which is the input string normalized to Unicode normalization form D, KD, C, or KC, respectively. When the specified result is equal in the sense of string=? to the argument, these procedures may return the argument instead of a newly allocated string.

(string-normalize-nfd "\xE9;")          ⇒ "\x65;\x301;"
(string-normalize-nfc "\xE9;")          ⇒ "\xE9;"
(string-normalize-nfd "\x65;\x301;")    ⇒ "\x65;\x301;"
(string-normalize-nfc "\x65;\x301;")    ⇒ "\xE9;"