PEG Internals (Guile Reference Manual)

Previous: PEG Tutorial, Up: PEG Parsing [Contents][Index]

6.15.4 PEG Internals

A PEG parser takes a string as input and attempts to parse it as a given nonterminal. The key idea of the PEG implementation is that every nonterminal is just a function that takes a string as an argument and attempts to parse that string as its nonterminal. The functions always start from the beginning, but a parse is considered successful if there is material left over at the end.

This makes it easy to model different PEG parsing operations. For instance, consider the PEG grammar "ab", which could also be written (and "a" "b"). It matches the string “ab”. Here’s how that might be implemented in the PEG style:

(define (match-and-a-b str)
  (match-a str)
  (match-b str))

As you can see, the use of functions provides an easy way to model sequencing. In a similar way, one could model (or a b) with something like the following:

(define (match-or-a-b str)
  (or (match-a str) (match-b str)))

Here the semantics of a PEG or expression map naturally onto Scheme’s or operator. This function will attempt to run (match-a str), and return its result if it succeeds. Otherwise it will run (match-b str).

Of course, the code above wouldn’t quite work. We need some way for the parsing functions to communicate. The actual interface used is below.

Parsing Function Interface

A parsing function takes three arguments - a string, the length of that string, and the position in that string it should start parsing at. In effect, the parsing functions pass around substrings in pieces - the first argument is a buffer of characters, and the second two give a range within that buffer that the parsing function should look at.

Parsing functions return either #f, if they failed to match their nonterminal, or a list whose first element must be an integer representing the final position in the string they matched and whose cdr can be any other data the function wishes to return, or ’() if it doesn’t have any more data.

The one caveat is that if the extra data it returns is a list, any adjacent strings in that list will be appended by match-pattern. For instance, if a parsing function returns (13 ("a" "b" "c")), match-pattern will take (13 ("abc")) as its value.

For example, here is a function to match “ab” using the actual interface.

(define (match-a-b str len pos)
   (and (<= (+ pos 2) len)
        (string= str "ab" pos (+ pos 2))
        (list (+ pos 2) '()))) ; we return no extra information

The above function can be used to match a string by running (match-pattern match-a-b "ab").

Code Generators and Extensible Syntax

PEG expressions, such as those in a define-peg-pattern form, are interpreted internally in two steps.

First, any string PEG is expanded into an s-expression PEG by the code in the (ice-9 peg string-peg) module.

Then, the s-expression PEG that results is compiled into a parsing function by the (ice-9 peg codegen) module. In particular, the function compile-peg-pattern is called on the s-expression. It then decides what to do based on the form it is passed.

The PEG syntax can be expanded by providing compile-peg-pattern more options for what to do with its forms. The extended syntax will be associated with a symbol, for instance my-parsing-form, and will be called on all PEG expressions of the form

(my-parsing-form ...)

The parsing function should take two arguments. The first will be a syntax object containing a list with all of the arguments to the form (but not the form’s name), and the second will be the capture-type argument that is passed to define-peg-pattern.

New functions can be registered by calling (add-peg-compiler! symbol function), where symbol is the symbol that will indicate a form of this type and function is the code generating function described above. The function add-peg-compiler! is exported from the (ice-9 peg codegen) module.