Kawa: Strings

Strings

Strings are sequences of characters. The length of a string is the number of characters that it contains, as an exact non-negative integer. The valid indices of a string are the exact non-negative integers less than the length of the string. The first character of a string has index 0, the second has index 1, and so on.

Strings are implemented as a sequence of 16-bit char values, even though they’re semantically a sequence of 32-bit Unicode code points. A character whose value is greater than #xffff is represented using two surrogate characters. The implementation allows for natural interoperability with Java APIs. However it does make certain operations (indexing or counting based on character counts) difficult to implement efficiently. Luckily one rarely needs to index or count based on character counts; alternatives are discussed below.

There are different kinds of strings:

An istring is immutable: It is fixed, and cannot be modified. On the other hand, indexing (e.g. string-ref) is efficient (constant-time), while indexing of other string implementations takes time proportional to the index.

String literals are istrings, as are the return values of most of the procedures in this chapter.

An istring is an instance of the gnu.lists.IString class.
An mstring is mutable: You can replace individual characters (using string-set!). You can also change the mstring’s length by inserting or removing characters (using string-append! or string-replace!).

An mstring is an instance of the gnu.lists.FString class.
Any other object that implements the java.lang.CharSequence interface is also a string. This includes standard Java java.lang.String and java.lang.StringBuilder objects.

Some of the procedures that operate on strings ignore the difference between upper and lower case. The names of the versions that ignore case end with “-ci” (for “case insensitive”).

Compatibility: Many of the following procedures (for example string-append) return an immutable istring in Kawa, but return a “freshly allocated” mutable string in standard Scheme (include R7RS) as well as most Scheme implementations (including previous versions of Kawa). To get the “compatibility mode” versions of those procedures (which return mstrings), invoke Kawa with one the --r5rs, --r6rs, or --r7rs options, or you can import a standard library like (scheme base).

Type: string

The type of string objects. The underlying type is the interface java.lang.CharSequence. Immultable strings are gnu.lists.IString or java.lang.String, while mutable strings are gnu.lists.FString.

Basic string procedures

Procedure: string? obj

Return #t if obj is a string, #f otherwise.

Procedure: istring? obj

Return #t if obj is a istring (a immutable, constant-time-indexable string); #f otherwise.

Constructor: string char …

Return a string composed of the arguments. This is analogous to list.

Compatibility: The result is an istring, except in compatibility mode, when it is a new allocated mstring.

Procedure: string-length string

Return the number of characters in the given string as an exact integer object.

Performance note: If the string is not an istring, the calling string-length may take time proportional to the length of the string, because of the need to scan for surrogate pairs.

Procedure: string-ref string k

k must be a valid index of string. The string-ref procedure returns character k of string using zero–origin indexing.

Performance note: If the string is not an istring, then calling string-ref may take time proportional to k because of the need to check for surrogate pairs. An alternative is to use string-cursor-ref. If iterating through a string, use string-for-each.

Procedure: string-null? string

Is string the empty string? Same result as (= (string-length string) 0) but executes in O(1) time.

Procedure: string-every pred string [start end])

Procedure: string-any pred string [start end])

Checks to see if every/any character in string satisfies pred, proceeding from left (index start) to right (index end). These procedures are short-circuiting: if pred returns false, string-every does not call pred on subsequent characters; if pred returns true, string-any does not call pred on subsequent characters. Both procedures are “witness-generating”:

If string-every is given an empty interval (with start = end), it returns #t.

If string-every returns true for a non-empty interval (with start < end), the returned true value is the one returned by the final call to the predicate on (string-ref string (- end 1)).

If string-any returns true, the returned true value is the one returned by the predicate.

Note: The names of these procedures do not end with a question mark. This indicates a general value is returned instead of a simple boolean (#t or #f).

Immutable String Constructors

Procedure: string-tabulate proc len

Constructs a string of size len by calling proc on each value from 0 (inclusive) to len (exclusive) to produce the corresponding element of the string. The procedure proc accepts an exact integer as its argument and returns a character. The order in which proc is called on those indexes is not specifified.

Rationale: Although string-unfold is more general, string-tabulate is likely to run faster for the common special case it implements.

Procedure: string-unfold stop? mapper successor seed [base make-final]

Procedure: string-unfold-right stop? mapper successor seed [base make-final]

This is a fundamental and powerful constructor for strings.

successor is used to generate a series of “seed” values from the initial seed: seed, (successor seed), (successor² seed), (successor³ seed), ...

stop? tells us when to stop — when it returns true when applied to one of these seed values.

mapper maps each seed value to the corresponding character(s) in the result string, which are assembled into that string in left-to-right order. It is an error for mapper to return anything other than a character or string.

base is the optional initial/leftmost portion of the constructed string, which defaults to the empty string "". It is an error if base is anything other than a character or string.

make-final is applied to the terminal seed value (on which stop? returns true) to produce the final/rightmost portion of the constructed string. It defaults to (lambda (x) ""). It is an error for make-final to return anything other than a character or string.

string-unfold-right is the same as string-unfold except the results of mapper are assembled into the string in right-to-left order, base is the optional rightmost portion of the constructed string, and make-final produces the leftmost portion of the constructed string.

You can use it string-unfold to convert a list to a string, read a port into a string, reverse a string, copy a string, and so forth. Examples:
(define (port->string p)
  (string-unfold eof-object? values
                 (lambda (x) (read-char p))
                 (read-char p)))

(define (list->string lis)
  (string-unfold null? car cdr lis))

(define (string-tabulate f size)
  (string-unfold (lambda (i) (= i size)) f add1 0))
To map f over a list lis, producing a string:
(string-unfold null? (compose f car) cdr lis)
Interested functional programmers may enjoy noting that string-fold-right and string-unfold are in some sense inverses. That is, given operations knull?, kar, kdr, kons, and knil satisfying
(kons (kar x) (kdr x)) = x  and  (knull? knil) = #t
then
(string-fold-right kons knil (string-unfold knull? kar kdr x)) = x
and
(string-unfold knull? kar kdr (string-fold-right kons knil string)) = string.
This combinator pattern is sometimes called an “anamorphism.”

Selection

Procedure: substring string start end

string must be a string, and start and end must be exact integer objects satisfying:
0 <= start <= end <= (string-length string)
The substring procedure returns a newly allocated string formed from the characters of string beginning with index start (inclusive) and ending with index end (exclusive).

Procedure: string-take string nchars

Procedure: string-drop string nchars

Procedure: string-take-right string nchars

Procedure: string-drop-right string nchars

string-take returns an immutable string containing the first nchars of string; string-drop returns a string containing all but the first nchars of string. string-take-right returns a string containing the last nchars of string; string-drop-right returns a string containing all but the last nchars of string.
(string-take "Pete Szilagyi" 6) ⇒ "Pete S"
(string-drop "Pete Szilagyi" 6) ⇒ "zilagyi"

(string-take-right "Beta rules" 5) ⇒ "rules"
(string-drop-right "Beta rules" 5) ⇒ "Beta "
It is an error to take or drop more characters than are in the string:
(string-take "foo" 37) ⇒ error

Procedure: string-pad string len [char start end]

Procedure: string-pad-right string len [char start end]

Returns an istring of length len comprised of the characters drawn from the given subrange of string, padded on the left (right) by as many occurrences of the character char as needed. If string has more than len chars, it is truncated on the left (right) to length len. The char defaults to #\space
(string-pad     "325" 5) ⇒ "  325"
(string-pad   "71325" 5) ⇒ "71325"
(string-pad "8871325" 5) ⇒ "71325"

Procedure: string-trim string [pred start end]

Procedure: string-trim-right string [pred start end]

Procedure: string-trim-both string [pred start end]

Returns an istring obtained from the given subrange of string by skipping over all characters on the left / on the right / on both sides that satisfy the second argument pred: pred defaults to char-whitespace?.
(string-trim-both "  The outlook wasn't brilliant,  \n\r")
    ⇒ "The outlook wasn't brilliant,"

String Comparisons

Procedure: string=? string₁ string₂ string₃ …

Return #t if the strings are the same length and contain the same characters in the same positions. Otherwise, the string=? procedure returns #f.
(string=? "Straße" "Strasse")    ⇒ #f

Procedure: string<? string₁ string₂ string₃ …

Procedure: string>? string₁ string₂ string₃ …

Procedure: string<=? string₁ string₂ string₃ …

Procedure: string>=? string₁ string₂ string₃ …

These procedures return #t if their arguments are (respectively): monotonically increasing, monotonically decreasing, monotonically non-decreasing, or monotonically nonincreasing. These predicates are required to be transitive.

These procedures are the lexicographic extensions to strings of the corresponding orderings on characters. For example, string<? is the lexicographic ordering on strings induced by the ordering char<? on characters. If two strings differ in length but are the same up to the length of the shorter string, the shorter string is considered to be lexicographically less than the longer string.
(string<? "z" "ß")      ⇒ #t
(string<? "z" "zz")     ⇒ #t
(string<? "z" "Z")      ⇒ #f

Procedure: string-ci=? string₁ string₂ string₃ …

Procedure: string-ci<? string₁ string₂ string₃ …

Procedure: string-ci>? string₁ string₂ string₃ …

Procedure: string-ci<=? string₁ string₂ string₃ …

Procedure: string-ci>=? string₁ string₂ string₃ …

These procedures are similar to string=?, etc., but behave as if they applied string-foldcase to their arguments before invoking the corresponding procedures without -ci.
(string-ci<? "z" "Z")                   ⇒ #f
(string-ci=? "z" "Z")                   ⇒ #t
(string-ci=? "Straße" "Strasse")        ⇒ #t
(string-ci=? "Straße" "STRASSE")        ⇒ #t
(string-ci=? "ΧΑΟΣ" "χαοσ")             ⇒ #t

Conversions

Procedure: list->string list

The list->string procedure returns an istring formed from the characters in list, in order. It is an error if any element of list is not a character.

Compatibility: The result is an istring, except in compatibility mode, when it is an mstring.

Procedure: reverse-list->string list

An efficient implementation of (compose list->text reverse):
(reverse-list->text '(#\a #\B #\c))  ⇒ "cBa"
This is a common idiom in the epilogue of string-processing loops that accumulate their result using a list in reverse order. (See also string-concatenate-reverse for the “chunked” variant.)

Procedure: string->list string [start [end]]

The string->list procedure returns a newly allocated list of the characters of string between start and end, in order. The string->list and list->string procedures are inverses so far as equal? is concerned.

Procedure: vector->string vector [start [end]]

The vector->string procedure returns a newly allocated string of the objects contained in the elements of vector between start and end. It is an error if any element of vector between start and end is not a character, or is a character forbidden in strings.
(vector->string #(#\1 #\2 #\3))             ⇒ "123"
(vector->string #(#\1 #\2 #\3 #\4 #\5) 2 4) ⇒ "34"

Procedure: string->vector string [start [end]]

The string->vector procedure returns a newly created vector initialized to the elements of the string string between start and end.
(string->vector "ABC")       ⇒ #(#\A #\B #\C)
(string->vector "ABCDE" 1 3) ⇒ #(#\B #\C)

Procedure: string-upcase string

Procedure: string-downcase string

Procedure: string-titlecase string

Procedure: string-foldcase string

These procedures take a string argument and return a string result. They are defined in terms of Unicode’s locale–independent case mappings from Unicode scalar–value sequences to scalar–value sequences. In particular, the length of the result string can be different from the length of the input string. When the specified result is equal in the sense of string=? to the argument, these procedures may return the argument instead of a newly allocated string.

The string-upcase procedure converts a string to upper case; string-downcase converts a string to lower case. The string-foldcase procedure converts the string to its case–folded counterpart, using the full case–folding mapping, but without the special mappings for Turkic languages. The string-titlecase procedure converts the first cased character of each word, and downcases all other cased characters.
(string-upcase "Hi")              ⇒ "HI"
(string-downcase "Hi")            ⇒ "hi"
(string-foldcase "Hi")            ⇒ "hi"

(string-upcase "Straße")          ⇒ "STRASSE"
(string-downcase "Straße")        ⇒ "straße"
(string-foldcase "Straße")        ⇒ "strasse"
(string-downcase "STRASSE")       ⇒ "strasse"

(string-downcase "Σ")             ⇒ "σ"
; Chi Alpha Omicron Sigma:
(string-upcase "ΧΑΟΣ")            ⇒ "ΧΑΟΣ"
(string-downcase "ΧΑΟΣ")          ⇒ "χαος"
(string-downcase "ΧΑΟΣΣ")         ⇒ "χαοσς"
(string-downcase "ΧΑΟΣ Σ")        ⇒ "χαος σ"
(string-foldcase "ΧΑΟΣΣ")         ⇒ "χαοσσ"
(string-upcase "χαος")            ⇒ "ΧΑΟΣ"
(string-upcase "χαοσ")            ⇒ "ΧΑΟΣ"

(string-titlecase "kNock KNoCK")  ⇒ "Knock Knock"
(string-titlecase "who's there?") ⇒ "Who's There?"
(string-titlecase "r6rs")         ⇒ "R6rs"
(string-titlecase "R6RS")         ⇒ "R6rs"
Since these procedures are locale–independent, they may not be appropriate for some locales.

Kawa Note: The implementation of string-titlecase does not correctly handle the case where an initial character needs to be converted to multiple characters, such as “LATIN SMALL LIGATURE FL” which should be converted to the two letters "Fl".

Compatibility: The result is an istring, except in compatibility mode, when it is an mstring.

Procedure: string-normalize-nfd string

Procedure: string-normalize-nfkd string

Procedure: string-normalize-nfc string

Procedure: string-normalize-nfkc string

These procedures take a string argument and return a string result, which is the input string normalized to Unicode normalization form D, KD, C, or KC, respectively. When the specified result is equal in the sense of string=? to the argument, these procedures may return the argument instead of a newly allocated string.
(string-normalize-nfd "\xE9;")          ⇒ "\x65;\x301;"
(string-normalize-nfc "\xE9;")          ⇒ "\xE9;"
(string-normalize-nfd "\x65;\x301;")    ⇒ "\x65;\x301;"
(string-normalize-nfc "\x65;\x301;")    ⇒ "\xE9;"

Searching and matching

Procedure: string-prefix-length string₁ string₂ [start₁ end₁ start₂ end₂]

Procedure: string-suffix-length string₁ string₂ [start₁ end₁ start₂ end₂]

Return the length of the longest common prefix/suffix of string₁ and string₂. For prefixes, this is equivalent to their “mismatch index” (relative to the start indexes).

The optional start/end indexes restrict the comparison to the indicated substrings of string₁ and string₂.

Procedure: string-prefix? string₁ string₂ [start₁ end₁ start₂ end₂]

Procedure: string-suffix? string₁ string₂ [start₁ end₁ start₂ end₂]

Is string₁ a prefix/suffix of string₂?

The optional start/end indexes restrict the comparison to the indicated substrings of string₁ and string₂.

Procedure: string-index string pred [start end]

Procedure: string-index-right string pred [start end]

Procedure: string-skip string pred [start end]

Procedure: string-skip-right string pred [start end]

string-index searches through the given substring from the left, returning the index of the leftmost character satisfying the predicate pred. string-index-right searches from the right, returning the index of the rightmost character satisfying the predicate pred. If no match is found, these procedures return #f.

The start and end arguments specify the beginning and end of the search; the valid indexes relevant to the search include start but exclude end. Beware of “fencepost”" errors: when searching right-to-left, the first index considered is (- end 1), whereas when searching left-to-right, the first index considered is start. That is, the start/end indexes describe the same half-open interval [start,end) in these procedures that they do in other string procedures.

The -skip functions are similar, but use the complement of the criterion: they search for the first char that doesn’t satisfy pred. To skip over initial whitespace, for example, say
(substring string
            (or (string-skip string char-whitespace?)
                (string-length string))
            (string-length string))
These functions can be trivially composed with string-take and string-drop to produce take-while, drop-while, span, and break procedures without loss of efficiency.

Procedure: string-contains string₁ string₂ [start₁ end₁ start₂ end₂]

Procedure: string-contains-right string₁ string₂ [start₁ end₁ start₂ end₂]

Does the substring of string₁ specified by start₁ and end₁ contain the sequence of characters given by the substring of string₂ specified by start₂ and end₂?

Returns #f if there is no match. If start₂ = end₂, string-contains returns start₁ but string-contains-right returns end₁. Otherwise returns the index in string₁ for the first character of the first/last match; that index lies within the half-open interval [start₁,end₁), and the match lies entirely within the [start₁,end₁) range of string₁.
(string-contains "eek -- what a geek." "ee" 12 18) ; Searches "a geek"
    ⇒ 15
Note: The names of these procedures do not end with a question mark. This indicates a useful value is returned when there is a match.

Concatenation and replacing

Procedure: string-append string …

Returns a string whose characters form the concatenation of the given strings.

Compatibility: The result is an istring, except in compatibility mode, when it is an mstring.

Procedure: string-concatenate string-list

Concatenates the elements of string-list together into a single istring.

Rationale: Some implementations of Scheme limit the number of arguments that may be passed to an n-ary procedure, so the (apply string-append string-list) idiom, which is otherwise equivalent to using this procedure, is not as portable.

Procedure: string-concatenate-reverse string-list [final-string [end]])

With no optional arguments, calling this procedure is equivalent to (string-concatenate (reverse string-list)). If the optional argument final-string is specified, it is effectively consed onto the beginning of string-list before performing the list-reverse and string-concatenate operations.

If the optional argument end is given, only the characters up to but not including end in final-string are added to the result, thus producing
(string-concatenate 
  (reverse (cons (substring final-string 0 end)
                 string-list)))
For example:
(string-concatenate-reverse '(" must be" "Hello, I") " going.XXXX" 7)
  ⇒ "Hello, I must be going."
Rationale: This procedure is useful when constructing procedures that accumulate character data into lists of string buffers, and wish to convert the accumulated data into a single string when done. The optional end argument accommodates that use case when final-string is a bob-full mutable string, and is allowed (for uniformity) when final-string is an immutable string.

Procedure: string-join string-list [delimiter [grammar]]

This procedure is a simple unparser; it pastes strings together using the delimiter string, returning an istring.

The string-list is a list of strings. The delimiter is the string used to delimit elements; it defaults to a single space " ".

The grammar argument is a symbol that determines how the delimiter is used, and defaults to 'infix. It is an error for grammar to be any symbol other than these four:

'infix

An infix or separator grammar: insert the delimiter between list elements. An empty list will produce an empty string.

'strict-infix

Means the same as 'infix if the string-list is non-empty, but will signal an error if given an empty list. (This avoids an ambiguity shown in the examples below.)

'suffix

Means a suffix or terminator grammar: insert the delimiter after every list element.

'prefix

Means a prefix grammar: insert the delimiter before every list element.
(string-join '("foo" "bar" "baz"))
         ⇒ "foo bar baz"
(string-join '("foo" "bar" "baz") "")
         ⇒ "foobarbaz"
(string-join '("foo" "bar" "baz") ":")
         ⇒ "foo:bar:baz"
(string-join '("foo" "bar" "baz") ":" 'suffix)
         ⇒ "foo:bar:baz:"

;; Infix grammar is ambiguous wrt empty list vs. empty string:
(string-join '()   ":") ⇒ ""
(string-join '("") ":") ⇒ ""

;; Suffix and prefix grammars are not:
(string-join '()   ":" 'suffix)) ⇒ ""
(string-join '("") ":" 'suffix)) ⇒ ":"

Procedure: string-replace string₁ string₂ start₁ end₁ [start₂ end₂]

Returns

(string-append (substring string₁ 0 start₁)
               (substring string₂ start₂ end₂)
               (substring string₁ end₁ (string-length string₁)))

That is, the segment of characters in string₁ from start₁ to end₁ is replaced by the segment of characters in string₂ from start₂ to end₂. If start₁=end₁, this simply splices the characters drawn from string₂ into string₁ at that position.

Examples:

(string-replace "The TCL programmer endured daily ridicule."
                 "another miserable perl drone" 4 7 8 22)
    ⇒ "The miserable perl programmer endured daily ridicule."

(string-replace "It's easy to code it up in Scheme." "lots of fun" 5 9)
    ⇒ "It's lots of fun to code it up in Scheme."

(define (string-insert s i t) (string-replace s t i i))

(string-insert "It's easy to code it up in Scheme." 5 "really ")
    ⇒ "It's really easy to code it up in Scheme."

(define (string-set s i c) (string-replace s (string c) i (+ i 1)))

(string-set "String-ref runs in O(n) time." 19 #\1)
    ⇒ "String-ref runs in O(1) time."

Also see string-append! and string-replace! for destructive changes to a mutable string.

Mapping and folding

Procedure: string-fold kons knil string [start end]

Procedure: string-fold-right kons knil string [start end]

These are the fundamental iterators for strings.

The string-fold procedure maps the kons procedure across the given string from left to right:

(... (kons string₂ (kons string₁ (kons string₀ knil))))

In other words, string-fold obeys the (tail) recursion

  (string-fold kons knil string start end)
= (string-fold kons (kons string_start knil) start+1 end)

The string-fold-right procedure maps kons across the given string string from right to left:

(kons string₀
      (... (kons string_end-3
                 (kons string_end-2
                       (kons string_end-1
                             knil)))))

obeying the (tail) recursion

  (string-fold-right kons knil string start end)
= (string-fold-right kons (kons string_end-1 knil) start end-1)

Examples:

;;; Convert a string or string to a list of chars.
(string-fold-right cons '() string)

;;; Count the number of lower-case characters in a string or string.
(string-fold (lambda (c count)
                (if (char-lower-case? c)
                    (+ count 1)
                    count))
              0
              string)

The string-fold-right combinator is sometimes called a "catamorphism."

Procedure: string-for-each proc string₁ string₂ …

Procedure: string-for-each proc string₁ [start [end]]

The strings must all have the same length. proc should accept as many arguments as there are strings.

The start-end variant is provided for compatibility with the SRFI-13 version. (In that case start and end count code Unicode scalar values (character values), not Java 16-bit char values.)

The string-for-each procedure applies proc element–wise to the characters of the strings for its side effects, in order from the first characters to the last. proc is always called in the same dynamic environment as string-for-each itself.

Analogous to for-each.
(let ((v '()))
  (string-for-each
    (lambda (c) (set! v (cons (char->integer c) v)))
    "abcde")
   v)
  ⇒ (101 100 99 98 97)
Performance note: The compiler generates efficient code for string-for-each. If proc is a lambda expression, it is inlined.

Procedure: string-map proc string₁ string₂ …

The string-map procedure applies proc element-wise to the elements of the strings and returns a string of the results, in order. It is an error if proc does not accept as many arguments as there are strings, or return other than a single character or a string. If more than one string is given and not all strings have the same length, string-map terminates when the shortest string runs out. The dynamic order in which proc is applied to the elements of the strings is unspecified.
(string-map char-foldcase "AbdEgH")  ⇒ "abdegh"
(string-map
  (lambda (c) (integer->char (+ 1 (char->integer c))))
  "HAL")
        ⇒ "IBM"
(string-map
  (lambda (c k)
    ((if (eqv? k #\u) char-upcase char-downcase) c))
  "studlycaps xxx"
  "ululululul")
        ⇒ "StUdLyCaPs"
Traditionally the result of proc had to be a character, but Kawa (and SRFI-140) allows the result to be a string.

Performance note: The string-map procedure has not been optimized (mainly because it is not very useful): The characters are boxed, and the proc is not inlined even if it is a lambda expression.

Procedure: string-map-index proc string [start end]

Calls proc on each valid index of the specified substring, converts the results of those calls into strings, and returns the concatenation of those strings. It is an error for proc to return anything other than a character or string. The dynamic order in which proc is called on the indexes is unspecified, as is the dynamic order in which the coercions are performed. If any strings returned by proc are mutated after they have been returned and before the call to string-map-index has returned, then string-map-index returns a string with unspecified contents; the string-map-index procedure itself does not mutate those strings.

Procedure: string-for-each-index proc string [start end]

Calls proc on each valid index of the specified substring, in increasing order, discarding the results of those calls. This is simply a safe and correct way to loop over a substring.

Example:
(let ((txt (string->string "abcde"))
      (v '()))
  (string-for-each-index
    (lambda (cur) (set! v (cons (char->integer (string-ref txt cur)) v)))
    txt)
  v) ⇒ (101 100 99 98 97)

Procedure: string-count string pred [start end]

Returns a count of the number of characters in the specified substring of string that satisfy the predicate pred.

Procedure: string-filter pred string [start end]

Procedure: string-remove pred string [start end]

Return an immutable string consisting of only selected characters, in order: string-filter selects only the characters that satisfy pred; string-remove selects only the characters that not satisfy pred

Replication & splitting

Procedure: string-repeat string-or-character len

Create an istring by repeating the first argument len times. If the first argument is a character, it is as if it were wrapped with the string constructor. We can define string-repeat in terms of the more general xsubstring procedure:
(define (string-repeat S N)
   (let ((T (if (char? S) (string S) S)))
     (xsubstring T 0 (* N (string-length T))))

Procedure: xsubstring string [from to [start end]]

This is an extended substring procedure that implements replicated copying of a substring. The string is a string; start and end are optional arguments that specify a substring of string, defaulting to 0 and the length of string. This substring is conceptually replicated both up and down the index space, in both the positive and negative directions. For example, if string is "abcdefg", start is 3, and end is 6, then we have the conceptual bidirectionally-infinite string
  ...  d  e  f  d  e  f  d  e  f  d  e  f  d  e  f  d  e  f  d ...
      -9 -8 -7 -6 -5 -4 -3 -2 -1  0 +1 +2 +3 +4 +5 +6 +7 +8 +9
xsubstring returns the substring of the string beginning at index from, and ending at to. It is an error if from is greater than to.

If from and to are missing they default to 0 and from+(end-start), respectively. This variant is a generalization of using substring, but unlike substring never shares substructures that would retain characters or sequences of characters that are substructures of its first argument or previously allocated objects.

You can use xsubstring to perform a variety of tasks:

To rotate a string left: (xsubstring "abcdef" 2 8) ⇒ "cdefab"

To rotate a string right: (xsubstring "abcdef" -2 4) ⇒ "efabcd"

To replicate a string: (xsubstring "abc" 0 7) ⇒ "abcabca"

Note that

The from/to arguments give a half-open range containing the characters from index from up to, but not including, index to.

The from/to indexes are not expressed in the index space of string. They refer instead to the replicated index space of the substring defined by string, start, and end.

It is an error if start=end, unless from=to, which is allowed as a special case.

Procedure: string-split string delimiter [grammar limit start end]

Returns a list of strings representing the words contained in the substring of string from start (inclusive) to end (exclusive). The delimiter is a string to be used as the word separator. This will often be a single character, but multiple characters are allowed for use cases such as splitting on "\r\n". The returned list will have one more item than the number of non-overlapping occurrences of the delimiter in the string. If delimiter is an empty string, then the returned list contains a list of strings, each of which contains a single character.

The grammar is a symbol with the same meaning as in the string-join procedure. If it is infix, which is the default, processing is done as described above, except an empty string produces the empty list; if grammar is strict-infix, then an empty string signals an error. The values prefix and suffix cause a leading/trailing empty string in the result to be suppressed.

If limit is a non-negative exact integer, at most that many splits occur, and the remainder of string is returned as the final element of the list (so the result will have at most limit+1 elements). If limit is not specified or is #f, then as many splits as possible are made. It is an error if limit is any other value.

To split on a regular expression, you can use SRFI 115’s regexp-split procedure.

String mutation

The following procedures create a mutable string, i.e. one that you can modify.

Procedure: make-string [k [char]]

Return a newly allocated mstring of k characters, where k defaults to 0. If char is given, then all elements of the string are initialized to char, otherwise the contents of the string are unspecified.

The 1-argument version is deprecated as poor style, except when k is 0.

Rationale: In many languags the most common pattern for mutable strings is to allocate an empty string and incrementally append to it. It seems natural to initialize the string with (make-string), rather than (make-string 0).

To return an immutable string that repeats k times a character char use string-repeat.

This is as R7RS, except the result is variable-size and we allow leaving out k when it is zero.

Procedure: string-copy string [start [end]]

Returns a newly allocated mutable (mstring) copy of the part of the given string between start and end.

The following procedures modify a mutable string.

Procedure: string-set! string k char

This procedure stores char in element k of string.
(define s1 (make-string 3 #\*))
(define s2 "***")
(string-set! s1 0 #\?) ⇒ void
s1 ⇒ "?**"
(string-set! s2 0 #\?) ⇒ error
(string-set! (symbol->string 'immutable) 0 #\?) ⇒ error
Performance note: Calling string-set! may take time proportional to the length of the string: First it must scan for the right position, like string-ref does. Then if the new character requires using a surrogate pair (and the old one doesn’t) then we have to make room in the string, possibly re-allocating a new char array. Alternatively, if the old character requires using a surrogate pair (and the new one doesn’t) then following characters need to be moved.

The function string-set! is deprecated: It is inefficient, and it very seldom does the correct thing. Instead, you can construct a string with string-append!.

Procedure: string-append! string value …

The string must be a mutable string, such as one returned by make-string or string-copy. The string-append! procedure extends string by appending each value (in order) to the end of string. Each value should be a character or a string.

Performance note: The compiler converts a call with multiple values to multiple string-append! calls. If a value is known to be a character, then no boxing (object-allocation) is needed.

The following example shows how to efficiently process a string using string-for-each and incrementally “build” a result string using string-append!.
(define (translate-space-to-newline str::string)::string
  (let ((result (make-string 0)))
    (string-for-each
     (lambda (ch)
       (string-append! result
                       (if (char=? ch #\Space) #\Newline ch)))
     str)
    result))

Procedure: string-copy! to at from [start [end]]

Copies the characters of the string from that are between start end end into the string to, starting at index at. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes place as if the source is first copied into a temporary string and then into the destination. (This is achieved without allocating storage by making sure to copy in the correct direction in such circumstances.)

This is equivalent to (and implemented as):
(string-replace! to at (+ at (- end start)) from start end))
(define a "12345")
(define b (string-copy "abcde"))
(string-copy! b 1 a 0 2)
b  ⇒  "a12de"

Procedure: string-replace! dst dst-start dst-end src [src-start [src-end]]

Replaces the characters of string dst (between dst-start and dst-end) with the characters of src (between src-start and src-end). The number of characters from src may be different than the number replaced in dst, so the string may grow or contract. The special case where dst-start is equal to dst-end corresponds to insertion; the case where src-start is equal to src-end corresponds to deletion. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes place as if the source is first copied into a temporary string and then into the destination. (This is achieved without allocating storage by making sure to copy in the correct direction in such circumstances.)

Procedure: string-fill! string fill [start [end]]

The string-fill! procedure stores fill in the elements of string between start and end. It is an error if fill is not a character or is forbidden in strings.

Strings as sequences

Indexing a string

Using function-call syntax with strings is convenient and efficient. However, it has some “gotchas”.

We will use the following example string:

(! str1 "Smile \x1f603;!")

or if you’re brave:

(! str1 "Smile 😃!")

This is "Smile " followed by an emoticon (“smiling face with open mouth”) followed by "!". The emoticon has scalar value \x1f603 - it is not in the 16-bit Basic Multi-language Plane, and so it must be encoded by a surrogate pair (#\xd83d followed by #\xde03).

The number of scalar values (characters) is 8, while the number of 16-bits code units (chars) is 9. The java.lang.CharSequence:length method counts chars. Both the length and the string-length procedures count characters. Thus:

(length str1)          ⇒ 8
(string-length str1)   ⇒ 8
(str1:length)          ⇒ 9

Counting chars is a constant-time operation (since it is stored in the data structure). Counting characters depends on the representation used: In geneeral it may take time proportional to the length of the string, since it has to subtract one for each surrogate pair; however the istring type (gnu.lists.IString class) uses a extra structure so it can count characters in constant-time.

Similarly we can can index the string in 3 ways:

(str1 1)              ⇒ #\m :: character
(string-ref str1 1)   ⇒ #\m :: character
(str1:charAt 1)       ⇒ #\m :: char

Using function-call syntax when the “function” is a string and a single integer argument is the same as using string-ref.

Things become interesting when we reach the emoticon:

(str1 6)              ⇒ #\😃 :: character
(str1:charAt 6)       ⇒ #\d83d :: char

Both string-ref and the function-call syntax return the real character, while the charAt methods returns a partial character.

(str1 7)              ⇒ #\! :: character
(str1:charAt 7)       ⇒ #\de03 :: char
(str1 8)              ⇒ throws StringIndexOutOfBoundsException
(str1:charAt 8)       ⇒ #\! :: char

Indexing with a sequence

You can index a string with a list of integer indexes, most commonly a range:

(str [i ...])

is basically the same as:

(string (str i) ...)

Generally when working with strings it is best to work with substrings rather than individual characters:

(str [start <: end])

This is equivalent to invoking the substring procedure:

(substring str start end)

String Cursor API

Indexing into a string (using for example string-ref) is inefficient because of the possible presence of surrogate pairs. Hence given an index i access normally requires linearly scanning the string until we have seen i characters.

The string-cursor API is defined in terms of abstract “cursor values”, which point to a position in the string. This avoids the linear scan.

Typical usage is:

(let* ((str whatever)
       (end (string-cursor-end str)))
  (do ((sc::string-cursor (string-cursor-start str)
                          (string-cursor-next str sc)))
    ((string-cursor>=? sc end))
    (let ((ch (string-cursor-ref str sc)))
      (do-something-with ch))))

Alternatively, the following may be marginally faster:

(let* ((str whatever)
       (end (string-cursor-end str)))
  (do ((sc::string-cursor (string-cursor-start str)
                          (string-cursor-next-quick sc)))
    ((string-cursor>=? sc end))
    (let ((ch (string-cursor-ref str sc)))
      (if (not (char=? ch #\ignorable-char))
        (do-something-with ch)))))

The API is non-standard, but is based on that in Chibi Scheme.

Type: string-cursor

An abstract position (index) in a string. Implemented as a primitive int which counts the number of preceding code units (16-bit char values).

Procedure: string-cursor-start str

Returns a cursor for the start of the string. The result is always 0, cast to a string-cursor.

Procedure: string-cursor-end str

Returns a cursor for the end of the string - one past the last valid character. Implemented as (as string-cursor (invoke str 'length)).

Procedure: string-cursor-ref str cursor

Return the character at the cursor. If the cursor points to the second char of a surrogate pair, returns #\ignorable-char.

Procedure: string-cursor-next string cursor [count]

Return the cursor position count (default 1) character positions forwards beyond cursor. For each count this may add either 1 or 2 (if pointing at a surrogate pair) to the cursor.

Procedure: string-cursor-next-quiet cursor

Increment cursor by one raw char position, even if cursor points to the start of a surrogate pair. (In that case the next string-cursor-ref will return #\ignorable-char.) Same as (+ cursor 1) but with the string-cursor type.

Procedure: string-cursor-prev string cursor [count]

Return the cursor position count (default 1) character positions backwards before cursor.

Procedure: substring-cursor string [start [end]]

Create a substring of the section of string between the cursors start and end.

Procedure: string-cursor<? cursor1 cursor2

Procedure: string-cursor<=? cursor1 cursor2

Procedure: string-cursor=? cursor1 cursor2

Procedure: string-cursor>=? cursor1 cursor2

Procedure: string-cursor>? cursor1 cursor2

Is the position of cursor1 respectively before, before or same, same, after, or after or same, as cursor2.

Performance note: Implemented as the corresponding int comparison.

Procedure: string-cursor-for-each proc string [start [end]]

Apply the procedure proc to each character position in string between the cursors start and end.