Semantic Tokens (Bison 3.8.1)

7.1 Semantic Info in Token Kinds

The C language has a context dependency: the way an identifier is used depends on what its current meaning is. For example, consider this:

foo (x);

This looks like a function call statement, but if foo is a typedef name, then this is actually a declaration of x. How can a Bison parser for C decide how to parse this input?

The method used in GNU C is to have two different token kinds, IDENTIFIER and TYPENAME. When yylex finds an identifier, it looks up the current declaration of the identifier in order to decide which token kind to return: TYPENAME if the identifier is declared as a typedef, IDENTIFIER otherwise.

The grammar rules can then express the context dependency by the choice of token kind to recognize. IDENTIFIER is accepted as an expression, but TYPENAME is not. TYPENAME can start a declaration, but IDENTIFIER cannot. In contexts where the meaning of the identifier is not significant, such as in declarations that can shadow a typedef name, either TYPENAME or IDENTIFIER is accepted—there is one rule for each of the two token kinds.

This technique is simple to use if the decision of which kinds of identifiers to allow is made at a place close to where the identifier is parsed. But in C this is not always so: C allows a declaration to redeclare a typedef name provided an explicit type has been specified earlier:

typedef int foo, bar;
int baz (void)

{
  static bar (bar);      /* redeclare bar as static variable */
  extern foo foo (foo);  /* redeclare foo as function */
  return foo (bar);
}

Unfortunately, the name being declared is separated from the declaration construct itself by a complicated syntactic structure—the “declarator”.

As a result, part of the Bison parser for C needs to be duplicated, with all the nonterminal names changed: once for parsing a declaration in which a typedef name can be redefined, and once for parsing a declaration in which that can’t be done. Here is a part of the duplication, with actions omitted for brevity:

initdcl:
  declarator maybeasm '=' init
| declarator maybeasm
;

notype_initdcl:
  notype_declarator maybeasm '=' init
| notype_declarator maybeasm
;

Here initdcl can redeclare a typedef name, but notype_initdcl cannot. The distinction between declarator and notype_declarator is the same sort of thing.

There is some similarity between this technique and a lexical tie-in (described next), in that information which alters the lexical analysis is changed during parsing by other parts of the program. The difference is here the information is global, and is used for other purposes in the program. A true lexical tie-in has a special-purpose flag controlled by the syntactic context.