Kawa: Regular expressions

Regular expressions

Kawa provides regular expressions, which is a convenient mechanism for matching a string against a pattern and maybe replacing matching parts.

A regexp is a string that describes a pattern. A regexp matcher tries to match this pattern against (a portion of) another string, which we will call the text string. The text string is treated as raw text and not as a pattern.

Most of the characters in a regexp pattern are meant to match occurrences of themselves in the text string. Thus, the pattern “abc” matches a string that contains the characters “a”, “b”, “c” in succession.

In the regexp pattern, some characters act as metacharacters, and some character sequences act as metasequences. That is, they specify something other than their literal selves. For example, in the pattern “a.c”, the characters “a” and “c” do stand for themselves but the metacharacter “.” can match any character (other than newline). Therefore, the pattern “a.c” matches an “a”, followed by any character, followed by a “c”.

If we needed to match the character “.” itself, we escape it, ie, precede it with a backslash “\”. The character sequence “\.” is thus a metasequence, since it doesn’t match itself but rather just “.”. So, to match “a” followed by a literal “.” followed by “c” we use the regexp pattern “a\.c”. To write this as a Scheme string literal, you need to quote the backslash, so you need to write "a\\.c". Kawa also allows the literal syntax #/a\.c/, which avoids the need to double the backslashes.

You can choose between two similar styles of regular expressions. The two differ slightly in terms of which characters act as metacharacters, and what those metacharacters mean:

Functions starting with regex- are implemented using the java.util.regex package. This is likely to be more efficient, has better Unicode support and some other minor extra features, and literal syntax #/a\.c/ mentioned above.
Functions starting with pregexp- are implemented in pure Scheme using Dorai Sitaram’s “Portable Regular Expressions for Scheme” library. These will be portable to more Scheme implementations, including BRL, and is available on older Java versions.

Java regular expressions

The syntax for regular expressions is documented here.

Type: regex

A compiled regular expression, implemented as java.util.regex.Pattern.

Constructor: regex arg

Given a regular expression pattern (as a string), compiles it to a regex object.
(regex "a\\.c")
This compiles into a pattern that matches an “a”, followed by any character, followed by a “c”.

The Scheme reader recognizes “#/” as the start of a regular expression pattern literal, which ends with the next un-escaped “/”. This has the big advantage that you don’t need to double the backslashes:

#/a\.c/

This is equivalent to (regex "a\\.c"), except it is compiled at read-time. If you need a literal “/” in a pattern, just escape it with a backslash: “#/a\/c/” matches a “a”, followed by a “/”, followed by a “c”.

You can add single-letter modifiers following the pattern literal. The following modifiers are allowed:

i

The modifier “i” cause the matching to ignore case. For example the following pattern matches “a” or “A”.

#/a/i

m

Enables “metaline” mode. Normally metacharacters “^” and “$’ match at the start end end of the entire input string. In metaline mode “^” and “$” also match just before or after a line terminator.

Multiline mode can also be enabled by the metasequence “(?m)”.

s

Enable “singleline” (aka “dot-all”) mode. In this mode the matacharacter “. matches any character, including a line breaks. This mode be enabled by the metasequence “(?s)”.

The following functions accept a regex either as a pattern string or a compiled regex pattern. I.e. the following are all equivalent:

(regex-match "b\\.c" "ab.cd")
(regex-match #/b\.c/ "ab.cd")
(regex-match (regex "b\\.c") "ab.cd")
(regex-match (java.util.regex.Pattern:compile "b\\.c") "ab.cd")

These all evaluate to the list ("b.c").

The following functions must be imported by doing one of:

(require 'regex) ;; or
(import (kawa regex))

Procedure: regex-match-positions regex string [start [end]]

The procedure regex‑match‑position takes pattern and a text string, and returns a match if the regex matches (some part of) the text string.

Returns #f if the regexp did not match the string; and a list of index pairs if it did match.
(regex-match-positions "brain" "bird") ⇒ #f
(regex-match-positions "needle" "hay needle stack")
  ⇒ ((4 . 10))
In the second example, the integers 4 and 10 identify the substring that was matched. 4 is the starting (inclusive) index and 10 the ending (exclusive) index of the matching substring.
(substring "hay needle stack" 4 10) ⇒ "needle"
In this case the return list contains only one index pair, and that pair represents the entire substring matched by the regexp. When we discuss subpatterns later, we will see how a single match operation can yield a list of submatches.

regex‑match‑positions takes optional third and fourth arguments that specify the indices of the text string within which the matching should take place.
(regex-match-positions "needle"
  "his hay needle stack -- my hay needle stack -- her hay needle stack"
  24 43)
  ⇒ ((31 . 37))
Note that the returned indices are still reckoned relative to the full text string.

Procedure: regex-match regex string [start [end]]

The procedure regex‑match is called like regex‑match‑positions but instead of returning index pairs it returns the matching substrings:
(regex-match "brain" "bird") ⇒ #f
(regex-match "needle" "hay needle stack")
  ⇒ ("needle")
regex‑match also takes optional third and fourth arguments, with the same meaning as does regex‑match‑positions.

Procedure: regex-split regex string

Takes two arguments, a regex pattern and a text string, and returns a list of substrings of the text string, where the pattern identifies the delimiter separating the substrings.
(regex-split ":" "/bin:/usr/bin:/usr/bin/X11:/usr/local/bin")
  ⇒ ("/bin" "/usr/bin" "/usr/bin/X11" "/usr/local/bin")

(regex-split " " "pea soup")
  ⇒ ("pea" "soup")
If the first argument can match an empty string, then the list of all the single-character substrings is returned, plus we get a empty strings at each end.
(regex-split "" "smithereens")
  ⇒ ("" "s" "m" "i" "t" "h" "e" "r" "e" "e" "n" "s" "")
(Note: This behavior is different from pregexp-split.)

To identify one-or-more spaces as the delimiter, take care to use the regexp “ +”, not “ *”.
(regex-split " +" "split pea     soup")
  ⇒ ("split" "pea" "soup")
(regex-split " *" "split pea     soup")
  ⇒ ("" "s" "p" "l" "i" "t" "" "p" "e" "a" "" "s" "o" "u" "p" "")

Procedure: regex‑replace regex string replacement

Replaces the matched portion of the text string by another a replacdement string.
(regex-replace "te" "liberte" "ty")
  ⇒ "liberty"
Submatches can be used in the replacement string argument. The replacement string can use “$n” as a backreference to refer back to the nth submatch, ie, the substring that matched the nth subpattern. “$0” refers to the entire match.
(regex-replace #/_(.+?)_/
               "the _nina_, the _pinta_, and the _santa maria_"
		"*$1*"))
  ⇒ "the *nina*, the _pinta_, and the _santa maria_"

Procedure: regex‑replace* regex string replacement

Replaces all matches in the text string by the replacement string:

(regex-replace* "te" "liberte egalite fraternite" "ty")
  ⇒ "liberty egality fratyrnity"
(regex-replace* #/_(.+?)_/
                "the _nina_, the _pinta_, and the _santa maria_"
                "*$1*")
  ⇒ "the *nina*, the *pinta*, and the *santa maria*"

Procedure: regex-quote pattern

Takes an arbitrary string and returns a pattern string that precisely matches it. In particular, characters in the input string that could serve as regex metacharacters are escaped as needed.
(regex-quote "cons")
  ⇒ "\Qcons\E"
regex‑quote is useful when building a composite regex from a mix of regex strings and verbatim strings.

Portable Scheme regular expressions

This provides the procedures pregexp, pregexp‑match‑positions, pregexp‑match, pregexp‑split, pregexp‑replace, pregexp‑replace*, and pregexp‑quote.

Before using them, you must require them:

(require 'pregexp)

These procedures have the same interface as the corresponding regex- versions, but take slightly different pattern syntax. The replace commands use “\” instead of “$” to indicate substitutions. Also, pregexp‑split behaves differently from regex‑split if the pattern can match an empty string.

See here for details.