Parser Language (MIT/GNU Scheme 12.1)

14.14 Parser Language

Although it is possible to write parsers using the parser-buffer abstraction (see Parser Buffers), it is tedious. The problem is that the abstraction isn’t closely matched to the way that people think about syntactic structures. In this section, we introduce a higher-level mechanism that greatly simplifies the implementation of a parser.

The parser language described here allows the programmer to write BNF-like specifications that are translated into efficient Scheme code at compile time. The language is declarative, but it can be freely mixed with Scheme code; this allows the parsing of grammars that aren’t conveniently described in the language.

The language also provides backtracking. For example, this expression matches any sequence of alphanumeric characters followed by a single alphabetic character:

(*matcher
 (seq (* (char-set char-set:alphanumeric))
      (char-set char-set:alphabetic)))

The way that this works is that the matcher matches alphanumeric characters in the input stream until it finds a non-alphanumeric character. It then tries to match an alphabetic character, which of course fails. At this point, if it matched at least one alphanumeric character, it backtracks: the last matched alphanumeric is “unmatched”, and it again attempts to match an alphabetic character. The backtracking can be arbitrarily deep; the matcher will continue to back up until it finds a way to match the remainder of the expression.

So far, this sounds a lot like regular-expression matching (see Regular Expressions). However, there are some important differences.

The parser language uses a Scheme-like syntax that is easier to read and write than regular-expression notation.
The language provides macros so that common syntactic constructs can be abstracted.
The language mixes easily with Scheme code, allowing the full power of Scheme to be applied to program around limitations in the parser language.
The language provides expressive facilities for converting syntax into parsed structure. It also makes it easy to convert parsed strings into meaningful objects (e.g. numbers).
The language is compiled into machine language; regular expressions are usually interpreted.

Here is an example that shows off several of the features of the parser language. The example is a parser for XML start tags:

(*parser
 (with-pointer p
   (seq "<"
        parse-name
        parse-attribute-list
        (alt (match ">")
             (match "/>")
             (sexp
              (lambda (b)
                (error
                 (string-append
                  "Unterminated start tag at "
                  (parser-buffer-position-string p)))))))))

This shows that the basic description of a start tag is very similar to its BNF. Non-terminal symbols parse-name and parse-attribute-list do most of the work, and the noise strings "<" and ">" are the syntactic markers delimiting the form. There are two alternate endings for start tags, and if the parser doesn’t find either of the endings, the Scheme code (wrapped in sexp) is run to signal an error. The error procedure perror takes a pointer p, which it uses to indicate the position in the input stream at which the error occurred. In this case, that is the beginning of the start tag, i.e. the position of the leading "<" marker.

This example still looks pretty complicated, mostly due to the error-signalling code. In practice, this is abstracted into a macro, after which the expression is quite succinct:

(*parser
 (bracket "start tag"
     (seq (noise (string "<")) parse-name)
     (match (alt (string ">") (string "/>")))
   parse-attribute-list))

The bracket macro captures the pattern of a bracketed item, and hides much of the detail.

The parser language actually consists of two languages: one for defining matchers, and one for defining parsers. The languages are intentionally very similar, and are meant to be used together. Each sub-language is described below in its own section.

The parser language is a run-time-loadable option; to use it, execute

(load-option '*parser)

once before compiling any code that uses the language.