The matcher language is a declarative language for specifying a matcher procedure. A matcher procedure is a procedure that accepts a single parser-buffer argument and returns a boolean value indicating whether the match it performs was successful. If the match succeeds, the internal pointer of the parser buffer is moved forward over the matched text. If the match fails, the internal pointer is unchanged.
For example, here is a matcher procedure that matches the character ‘a’:
(lambda (b) (match-parser-buffer-char b #\a))
Here is another example that matches two given characters, c1 and c2, in sequence:
(lambda (b) (let ((p (get-parser-buffer-pointer b))) (if (match-parser-buffer-char b c1) (if (match-parser-buffer-char b c2) #t (begin (set-parser-buffer-pointer! b p) #f)) #f)))
This is code is clear, but has lots of details that get in the way of understanding what it is doing. Here is the same example in the matcher language:
(*matcher (seq (char c1) (char c2)))
This is much simpler and more intuitive. And it generates virtually the same code:
(pp (*matcher (seq (char c1) (char c2)))) -| (lambda (#[b1]) -| (let ((#[p1] (get-parser-buffer-pointer #[b1]))) -| (and (match-parser-buffer-char #[b1] c1) -| (if (match-parser-buffer-char #[b1] c2) -| #t -| (begin -| (set-parser-buffer-pointer! #[b1] #[p1]) -| #f)))))
Now that we have seen an example of the language, it’s time to look at
the detail. The
*matcher special form is the interface between
the matcher language and Scheme.
The operand mexp is an expression in the matcher language. The
*matcher expression expands into Scheme code that implements a
Here are the predefined matcher expressions. New matcher expressions can be defined using the macro facility (see Parser-language Macros). We will start with the primitive expressions.
These expressions match a given character. In each case, the expression operand is a Scheme expression that must evaluate to a character at run time. The ‘-ci’ expressions do case-insensitive matching. The ‘not-’ expressions match any character other than the given one.
These expressions match a given string. The expression operand
is a Scheme expression that must evaluate to a string at run time.
string-ci expression does case-insensitive matching.
These expressions match a single character that is a member of a given character set. The expression operand is a Scheme expression that must evaluate to a character set at run time.
end-of-input expression is successful only when there are
no more characters available to be matched.
discard-matched expression always successfully matches the
null string. However, it isn’t meant to be used as a matching
expression; it is used for its effect.
all of the buffered text prior to this point to be discarded (i.e.
discard-parser-buffer-head! on the parser buffer).
discard-matched may not be used in certain places in
a matcher expression. The reason for this is that it deliberately
discards information needed for backtracking, so it may not be used in
a place where subsequent backtracking will need to back over it. As a
rule of thumb, use
discard-matched only in the last operand of
alt expression (including any
alt expressions in which it is indirectly contained).
In addition to the above primitive expressions, there are two
convenient abbreviations. A character literal (e.g. ‘#\A’) is
a legal primitive expression, and is equivalent to a
expression with that literal as its operand (e.g. ‘(char
#\A)’). Likewise, a string literal is equivalent to a
expression (e.g. ‘(string "abc")’).
Next there are several combinator expressions. These closely correspond to similar combinators in regular expressions. Parameters named mexp are arbitrary expressions in the matcher language.
This matches each mexp operand in sequence. For example,
(seq (char-set char-set:alphabetic) (char-set char-set:numeric))
matches an alphabetic character followed by a numeric character, such as ‘H4’.
Note that if there are no mexp operands, the
expression successfully matches the null string.
This attempts to match each mexp operand in order from left to
right. The first one that successfully matches becomes the match for
alt expression participates in backtracking. If one of the
mexp operands matches, but the overall match in which this
expression is embedded fails, the backtracking mechanism will cause
alt expression to try the remaining mexp operands.
For example, if the expression
(seq (alt "ab" "a") "b")
is matched against the text ‘abc’, the
alt expression will
initially match its first operand. But it will then fail to match the
second operand of the
seq expression. This will cause the
alt to be restarted, at which time it will match ‘a’, and
the overall match will succeed.
Note that if there are no mexp operands, the
will always fail.
This matches zero or more occurrences of the mexp operand. (Consequently this match always succeeds.)
* expression participates in backtracking; if it matches
N occurrences of mexp, but the overall match fails, it
will backtrack to N-1 occurrences and continue. If the overall
match continues to fail, the
* expression will continue to
backtrack until there are no occurrences left.
This matches one or more occurrences of the mexp operand. It is equivalent to
(seq mexp (* mexp))
This matches zero or one occurrences of the mexp operand. It is equivalent to
(alt mexp (seq))
sexp expression allows arbitrary Scheme code to be embedded
inside a matcher. The expression operand must evaluate to a
matcher procedure at run time; the procedure is called to match the
parser buffer. For example,
(*matcher (seq "a" (sexp parse-foo) "b"))
(lambda (#[b1]) (let ((#[p1] (get-parser-buffer-pointer #[b1]))) (and (match-parser-buffer-char #[b1] #\a) (if (parse-foo #[b1]) (if (match-parser-buffer-char #[b1] #\b) #t (begin (set-parser-buffer-pointer! #[b1] #[p1]) #f)) (begin (set-parser-buffer-pointer! #[b1] #[p1]) #f)))))
The case in which expression is a symbol is so common that it has an abbreviation: ‘(sexp symbol)’ may be abbreviated as just symbol.
with-pointer expression fetches the parser buffer’s
internal pointer (using
get-parser-buffer-pointer), binds it to
identifier, and then matches the pattern specified by
mexp. Identifier must be a symbol.
This is meant to be used on conjunction with
sexp, as a way to
capture a pointer to a part of the input stream that is outside the
sexp expression. An example of the use of
appears above (see with-pointer example).