Language Grammar (GNU Emacs Lisp Reference Manual)

Next: Using Tree-sitter Parser, Up: Parsing Program Source [Contents][Index]

38.1 Tree-sitter Language Grammar

Loading a language grammar

Tree-sitter relies on language grammar to parse text in that language. In Emacs, a language grammar is represented by a symbol. For example, the C language grammar is represented as the symbol c, and c can be passed to tree-sitter functions as the language argument.

Tree-sitter language grammars are distributed as dynamic libraries. In order to use a language grammar in Emacs, you need to make sure that the dynamic library is installed on the system. Emacs looks for language grammars in several places, in the following order:

first, in the list of directories specified by the variable treesit-extra-load-path;
then, in the tree-sitter subdirectory of the directory specified by user-emacs-directory (see The Init File);
and finally, in the system’s default locations for dynamic libraries.

In each of these directories, Emacs looks for a file with file-name extensions specified by the variable dynamic-library-suffixes.

If Emacs cannot find the library or has problems loading it, Emacs signals the treesit-load-language-error error. The data of that signal could be one of the following:

(not-found error-msg …): This means that Emacs could not find the language grammar library.
(symbol-error error-msg): This means that Emacs could not find in the library the expected function that every language grammar library should export.
(version-mismatch error-msg): This means that the version of the language grammar library is incompatible with that of the tree-sitter library.

In all of these cases, error-msg might provide additional details about the failure.

Function: treesit-language-available-p language &optional detail ¶

This function returns non-nil if the language grammar for language exists and can be loaded.

If detail is non-nil, return (t . nil) when language is available, and (nil . data) when it’s unavailable. data is the signal data of treesit-load-language-error.

By convention, the file name of the dynamic library for language is libtree-sitter-language.ext, where ext is the system-specific extension for dynamic libraries. Also by convention, the function provided by that library is named tree_sitter_language. If a language grammar library doesn’t follow this convention, you should add an entry

(language library-base-name function-name)

to the list in the variable treesit-load-name-override-list, where library-base-name is the basename of the dynamic library’s file name (usually, libtree-sitter-language), and function-name is the function provided by the library (usually, tree_sitter_language). For example,

(cool-lang "libtree-sitter-coool" "tree_sitter_cooool")

for a language that considers itself too “cool” to abide by conventions.

Function: treesit-library-abi-version &optional min-compatible ¶: This function returns the version of the language grammar Application Binary Interface (ABI) supported by the tree-sitter library. By default, it returns the latest ABI version supported by the library, but if min-compatible is non-nil, it returns the oldest ABI version which the library still can support. Language grammar libraries must be built for ABI versions between the oldest and the latest versions supported by the tree-sitter library, otherwise the library will be unable to load them.

Function: treesit-language-abi-version language ¶: This function returns the ABI version of the language grammar library loaded by Emacs for language. If language is unavailable, this function returns nil.

Concrete syntax tree

A syntax tree is what a parser generates. In a syntax tree, each node represents a piece of text, and is connected to each other by a parent-child relationship. For example, if the source text is

1 + 2

its syntax tree could be

                  +--------------+
                  | root "1 + 2" |
                  +--------------+
                         |
        +--------------------------------+
        |       expression "1 + 2"       |
        +--------------------------------+
           |             |            |
+------------+   +--------------+   +------------+
| number "1" |   | operator "+" |   | number "2" |
+------------+   +--------------+   +------------+

We can also represent it as an s-expression:

(root (expression (number) (operator) (number)))

Node types

Names like root, expression, number, and operator specify the type of the nodes. However, not all nodes in a syntax tree have a type. Nodes that don’t have a type are known as anonymous nodes, and nodes with a type are named nodes. Anonymous nodes are tokens with fixed spellings, including punctuation characters like bracket ‘]’, and keywords like return.

Field names

To make the syntax tree easier to analyze, many language grammar assign field names to child nodes. For example, a function_definition node could have a declarator and a body:

(function_definition
 declarator: (declaration)
 body: (compound_statement))

Exploring the syntax tree

To aid in understanding the syntax of a language and in debugging Lisp programs that use the syntax tree, Emacs provides an “explore” mode, which displays the syntax tree of the source in the current buffer in real time. Emacs also comes with an “inspect mode”, which displays information of the nodes at point in the mode-line.

Command: treesit-explore-mode ¶: This mode pops up a window displaying the syntax tree of the source in the current buffer. Selecting text in the source buffer highlights the corresponding nodes in the syntax tree display. Clicking on nodes in the syntax tree highlights the corresponding text in the source buffer.

Command: treesit-inspect-mode ¶

This minor mode displays on the mode-line the node that starts at point. For example, the mode-line can display

parent field: (node (child (…)))

where node, child, etc., are nodes which begin at point. parent is the parent of node. node is displayed in a bold typeface. field-names are field names of node and of child, etc.

If no node starts at point, i.e., point is in the middle of a node, then the mode line displays the earliest node that spans point, and its immediate parent.

This minor mode doesn’t create parsers on its own. It uses the first parser in (treesit-parser-list) (see Using Tree-sitter Parser).

Reading the grammar definition

Authors of language grammars define the grammar of a programming language, which determines how a parser constructs a concrete syntax tree out of the program text. In order to use the syntax tree effectively, you need to consult the grammar file.

The grammar file is usually grammar.js in a language grammar’s project repository. The link to a language grammar’s home page can be found on tree-sitter’s homepage.

The grammar definition is written in JavaScript. For example, the rule matching a function_definition node may look like

function_definition: $ => seq(
  $.declaration_specifiers,
  field('declarator', $.declaration),
  field('body', $.compound_statement)
)

The rules are represented by functions that take a single argument $, representing the whole grammar. The function itself is constructed by other functions: the seq function puts together a sequence of children; the field function annotates a child with a field name. If we write the above definition in the so-called Backus-Naur Form (BNF) syntax, it would look like

function_definition :=
  <declaration_specifiers> <declaration> <compound_statement>

and the node returned by the parser would look like

(function_definition
  (declaration_specifier)
  declarator: (declaration)
  body: (compound_statement))

Below is a list of functions that one can see in a grammar definition. Each function takes other rules as arguments and returns a new rule.

seq(rule1, rule2, …)

matches each rule one after another.

choice(rule1, rule2, …)

matches one of the rules in its arguments.

repeat(rule)

matches rule zero or more times. This is like the ‘*’ operator in regular expressions.

repeat1(rule)

matches rule one or more times. This is like the ‘+’ operator in regular expressions.

optional(rule)

matches rule zero or one times. This is like the ‘?’ operator in regular expressions.

field(name, rule)

assigns field name name to the child node matched by rule.

alias(rule, alias)

makes nodes matched by rule appear as alias in the syntax tree generated by the parser. For example,

alias(preprocessor_call_exp, call_expression)

makes any node matched by preprocessor_call_exp appear as call_expression.

Below are grammar functions of lesser importance for reading a language grammar.

token(rule): marks rule to produce a single leaf node. That is, instead of generating a parent node with individual child nodes under it, everything is combined into a single leaf node. See Retrieving Nodes.
token.immediate(rule): Normally, grammar rules ignore preceding whitespace; this changes rule to match only when there is no preceding whitespace.
prec(n, rule): gives rule the level-n precedence.
prec.left([n,] rule): marks rule as left-associative, optionally with level n.
prec.right([n,] rule): marks rule as right-associative, optionally with level n.
prec.dynamic(n, rule): this is like prec, but the precedence is applied at runtime instead.

The documentation of the tree-sitter project has more about writing a grammar. Read especially “The Grammar DSL” section.