4. Syntax of input files

Treecc input files consist of zero or more declarations that define nodes, operations, options, etc. The following sections describe each of these elements.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.1 Node declarations

Node types are defined using the ‘node’ keyword in input files. The general form of the declaration is:

%node NAME [ PNAME ] [ FLAGS ] [ = FIELDS ]

‘NAME’

An identifier that is used to refer to the node type elsewhere in the treecc definition. It is also the name of the type that will be visible to the programmer in literal code blocks.

‘PNAME’

An identifier that refers to the parent node type that ‘NAME’ inherits from. If ‘PNAME’ is not supplied, then ‘NAME’ is a top-level declaration. It is legal to supply a ‘PNAME’ that has not yet been defined in the input.

‘FLAGS’

Any combination of ‘%abstract’ and ‘%typedef’:

‘%abstract’: The node type cannot be constructed by the programmer. In addition, the programmer does not need to define operation cases for this node type if all subtypes have cases associated with them.
‘%typedef’: The node type is used as the common return type for node creation functions. Top-level declarations must have a ‘%typedef’ keyword.

The ‘FIELDS’ part of a node declaration defines the fields that make up the node type. Each field has the following general form:

[ %nocreate ] TYPE FNAME [ = VALUE ] ';'

‘%nocreate’

The field is not used in the node's constructor. When the node is constructed, the value of this field will be undefined unless ‘VALUE’ is specified.

‘TYPE’

The type that is associated with the field. Types can be declared using a subset of the C declaration syntax, augmented with some C++ and Java features. See section Types used in fields and parameters, for more information.

‘FNAME’

The name to associate with the field. Treecc verifies that the field does not currently exist in this node type, or in any of its ancestor node types.

‘VALUE’

The default value to assign to the field in the node's constructor. This can only be used on fields that are declared with ‘%nocreate’. The value must be enclosed in braces. For example ‘{NULL}’ would be used to initialize a field with ‘NULL’.

The braces are required because the default value is expressed in the underlying source language, and can use any of the usual constant declaration features present in that language.

When the output language is C, treecc creates a struct-based type called ‘NAME’ that contains the fields for ‘NAME’ and all of its ancestor classes. The type also contains some house-keeping fields that are used internally by the generated code. The following is an example:

typedef struct binary__ binary;
struct binary__ {
    const struct binary_vtable__ *vtable__;
    int kind__;
    char *filename__;
    long linenum__;
    type_code type;
    expression * expr1;
    expression * expr2;
};

The programmer should avoid using any identifier that ends with ‘__’, because it may clash with house-keeping identifiers that are generated by treecc.

When the output language is C++, Java, or C#, treecc creates a class called ‘NAME’, that inherits from the class ‘PNAME’. The field definitions for ‘NAME’ are converted into public members in the output.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.2 Types used in fields and parameters

Types that are used in field and parameter declarations have a syntax which is subset of features found in C, C++, and Java:

TypeAndName ::= Type [ IDENTIFIER ]

Type ::= TypeName
       | Type '*'
       | Type '&'
       | Type '[' ']'

TypeName ::= IDENTIFIER { IDENTIFIER }

Types are usually followed by an identifier that names the field or parameter. The name is required for fields and is optional for parameters. For example ‘int’ is usually equivalent to ‘int x’ in parameter declarations.

The following are some examples of using types:

int
int x
const char *str
expression *expr
Element[][] array
Item&
unsigned int y
const Element

The grammar used by treecc is slightly ambiguous. The last example above declares a parameter called ‘Element’, that has type ‘const’. The programmer probably intended to declare an anonymous parameter with type ‘const Element’ instead.

This ambiguity is unavoidable given that treecc is not fully aware of the underlying language's type system. When treecc sees a type that ends in a sequence of identifiers, it will always interpret the last identifier as the field or parameter name. Thus, the programmer must write the following instead:

const Element e

Treecc cannot declare types using the full power of C's type system. The most common forms of declarations are supported, and the rest can usually be obtained by defining a ‘typedef’ within a literal code block. See section Literal code declarations, for more information on literal code blocks.

It is the responsibility of the programmer to use type constructs that are supported by the underlying programming language. Types such as ‘const char *’ will give an error when the output is compiled with a Java compiler, for example.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.3 Enumerated type declarations

Enumerated types are a special kind of node type that can be used by the programmer for simple values that don't require a full abstract syntax tree node. The following is an example of defining a list of the primitive machine types used in a Java virtual machine:

%enum JavaType =
{
    JT_BYTE,
    JT_SHORT,
    JT_CHAR,
    JT_INT,
    JT_LONG,
    JT_FLOAT,
    JT_DOUBLE,
    JT_OBJECT_REF
}

Enumerations are useful when writing code generators and type inferencing routines. The general form is:

%enum NAME = { VALUES }

‘NAME’: An identifier to be used to name the enumerated type. The name must not have been previously used as a node type, an enumerated type, or an enumerated value.
‘VALUES’: A comma-separated list of identifiers that name the values within the enumeration. Each of the names must be unique, and must not have been used previously as a node type, an enumerated type, or an enumerated value.

Logically, each enumerated value is a special node type that inherits from a parent node type corresponding to the enumerated type ‘NAME’.

When the output language is C or C++, treecc generates an enumerated typedef for ‘NAME’ that contains the enumerated values in the same order as was used in the input file. The typedef name can be used elsewhere in the code as the type of the enumeration.

When the output language is Java, treecc generates a class called ‘NAME’ that contains the enumerated values as integer constants. Elsewhere in the code, the type ‘int’ must be used to declare variables of the enumerated type. Enumerated values are referred to as ‘NAME.VALUE’. If the enumerated type is used as a trigger parameter, then ‘NAME’ must be used instead of ‘int’: treecc will convert the type when the Java code is output.

When the output language is C#, treecc generates an enumerated value type called ‘NAME’ that contains the enumerated values as members. The C# type ‘NAME’ can be used elsewhere in the code as the type of the enumeration. Enumerated values are referred to as ‘NAME.VALUE’.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.4 Operation declarations

Operations are declared in two parts: the declaration, and the cases. The declaration part defines the prototype for the operation and the cases define how to handle specific kinds of nodes for the operation.

Operations are defined over one or more trigger parameters. Each trigger parameter specifies a node type or an enumerated type that is selected upon to determine what course of action to take. The following are some examples of operation declarations:

%operation void infer_type(expression *e)
%operation type_code common_type([type_code t1], [type_code t2])

Trigger parameters are specified by enclosing them in square brackets. If none of the parameters are enclosed in square brackets, then treecc assumes that the first parameter is the trigger.

The general form of an operation declaration is as follows:

%operation { %virtual | %inline | %split } RTYPE [CLASS::]NAME(PARAMS)

‘%virtual’

Specifies that the operation is associated with a node type as a virtual method. There must be only one trigger parameter, and it must be the first parameter.

Non-virtual operations are written to the output source files as global functions.

‘%inline’

Optimise the generation of the operation code so that all cases are inline within the code for the function itself. This can only be used with non-virtual operations, and may improve code efficiency if there are lots of operation cases with a small amount of code in each.

‘%split’

Split the generation of the multi-trigger operation code across multiple functions, to reduce the size of each individual function. It is sometimes necessary to split large %inline operations to avoid compiler limits on function size.

‘RTYPE’

The type of the return value for the operation. This should be ‘void’ if the operation does not have a return value.

‘CLASS’

The name of the class to place the operation's definition within. This can only be used with non-virtual operations, and is intended for languages such as Java and C# that cannot declare methods outside of classes. The class name will be ignored if the output language is C.

If a class name is required, but the programmer did not supply it, then ‘NAME’ will be used as the default. The exception to this is the C# language: ‘CLASS’ must always be supplied and it must be different from ‘NAME’. This is due to a "feature" in some C# compilers that forbid a method with the same name as its enclosing class.

‘NAME’

The name of the operation.

‘PARAMS’

The parameters to the operation. Trigger parameters may be enclosed in square brackets. Trigger parameters must be either node types or enumerated types.

Once an operation has been declared, the programmer can specify its cases anywhere in the input files. It is not necessary that the cases appear after the operation, or that they be contiguous within the input files. This permits the programmer to place operation cases where they are logically required for maintainence reasons.

There must be sufficient operation cases defined to cover every possible combination of node types and enumerated values that inherit from the specified trigger types. An operation case has the following general form:

NAME(TRIGGERS) [, NAME(TRIGGERS2) ...]
{
    CODE
}

‘NAME’: The name of the operation for which this case applies.
‘TRIGGERS’: A comma-separated list of node types or enumerated values that define the specific case that is handled by the following code.
‘CODE’: Source code in the output source language that implements the operation case.

Multiple trigger combinations can be associated with a single block of code, by listing them all, separated by commas. For example:

common_type(int_type, int_type)
{
    return int_type;
}

common_type(int_type, float_type),
common_type(float_type, int_type),
common_type(float_type, float_type)
{
    return float_type;
}

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.5 Options that modify treecc's behaviour

"(*)" is used below to indicate an option that is enabled by default.

‘%option track_lines’

Enable the generation of code that can track the current filename and line number when nodes are created. See section Tracking line numbers in source files, for more information. (*)

‘%option no_track_lines’

Disable the generation of code that performs line number tracking.

‘%option singletons’

Optimise the creation of singleton node types. These are node types without any fields. Treecc can optimise the code so that only one instance of a singleton node type exists in the system. This can speed up the creation of nodes for constants within compilers. (*)

Singleton optimisations will have no effect if ‘track_lines’ is enabled, because line tracking uses special hidden fields in every node.

‘%option no_singletons’

Disable the optimisation of singleton node types.

‘%option reentrant’

Enable the generation of reentrant code that does not rely upon any global variables. Separate copies of the compiler state can be used safely in separate threads. However, the same copy of the compiler state cannot be used safely in two or more threads.

‘%option no_reentrant’

Disable the generation of reentrant code. The interface to node management functions is simpler, but cannot be used in a threaded environment. (*)

‘%option force’

Force output source files to be written, even if they are unchanged. This option can also be set using the ‘-f’ command-line option.

‘%option no_force’

Don't force output source files to be written if they are the same as before. (*)

This option can help smooth integration of treecc with make. Only those output files that have changed will be modified. This reduces the number of files that the underlying source language compiler must process after treecc is executed.

‘%option virtual_factory’

Use virtual methods in the node type factories, so that the programmer can subclass the factory and provide new implementations of node creation functions. This option is ignored for C, which does not use factories.

‘%option no_virtual_factory’

Don't use virtual methods in the node type factories. (*)

‘%option abstract_factory’

Use abstract virtual methods in the node type factories. The programmer is responsible for subclassing the factory to provide node creation functionality.

‘%option no_abstract_factory’

Don't use abstract virtual methods in the node type factories. (*)

‘%option kind_in_node’

Put the kind field in the node, for more efficient access at runtime. (*)

‘%option kind_in_vtable’

Put the kind field in the vtable, and not the node. This saves some memory, at the cost of slower access to the kind value at runtime. This option only applies when the language is C. The kind field is always placed in the node in other languages, because it isn't possible to modify the vtable.

‘%option prefix = PREFIX’

Specify the prefix to be used in output files in place of "yy".

‘%option state_type = NAME’

Specify the name of the state type. The state type is generated by treecc to perform centralised memory management and reentrancy support. The default value is ‘YYNODESTATE’. If the output language uses factories, then this will also be the name of the factory base class.

‘%option namespace = NAME’

Specify the namespace to write definitions to in the output source files. This option is ignored when the output language is C.

‘%option package = NAME’

Same as ‘%option namespace = NAME’. Provided because ‘package’ is more natural for Java programmers.

‘%option base = NUM’

Specify the numeric base to use for allocating numeric values to node types. By default, node type allocation begins at 1.

‘%option lang = LANGUAGE’

Specify the output language. Must be one of "C", "C++", "Java", "C#", "Ruby", "PHP", or "Python". The default is "C".

‘%option block_size = NUM’

Specify the size of the memory blocks to use in C and C++ node allocators.

‘%option strip_filenames’

Strip filenames down to their base name in #line directives. i.e. strip off the directory component. This can be helpful in combination with the %include %readonly command when treecc input files may processed from different directories, causing common output files to change unexpectedly.

‘%option no_strip_filenames’

Don't strip filenames in #line directives. (*)

‘%option internal_access’

Use internal as the access mode for classes in C#, rather than public.

‘%option public_access’

Use public as the access mode for classes in C#, rather than internal. (*)

‘%option print_lines’

Print #line markers in languages that use them. (*)

‘%option no_print_lines’

Do not print #line markers, even in languages that normally use them.

‘%option allocator’

Use treecc's standard node allocator for C and C++. This option has no effect for other output languages. (*)

‘%option no_allocator’

Do not use treecc's standard node allocator for C and C++. This can be useful when the programmer wants to redirect node allocation to their own routines.

‘%option gc_allocator’

Use libgc as a garbage-collecting node allocator for C and C++. This option has no effect for other output languages.

‘%option no_gc_allocator’

Do not use libgc as a garbage-collecting node allocator for C and C++. (*)

‘%option base_type’

Specify the base type for the root node of the treecc node heirarchy. The default is no base type.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.6 Literal code declarations

Sometimes it is necessary to embed literal code within output ‘.h’ and source files. Usually this is to ‘#include’ definitions from other files, or to define functions that cannot be easily expressed as operations.

A literal code block is specified by enclosing it in ‘%{’ and ‘%}’. The block can also be prefixed with the following flags:

‘%decls’: Write the literal code to the currently active declaration header file, instead of the source file.
‘%both’: Write the literal code to both the currently active declaration header file and the currently active source file.
‘%end’: Write the literal code to the end of the file, instead of the beginning.

Another form of literal code block is one which begins with ‘%%’ and extends to the end of the current input file. This form implicitly has the ‘%end’ flag.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.7 Changing input and output files

Most treecc compiler definitions will be too large to be manageable in a single input file. They also will be too large to write to a single output file, because that may overload the source language compiler.

Multiple input files can be specified on the command-line, or they can be explicitly included by other input files with the following declarations:

‘%include [ %readonly ] FILENAME’

Include the contents of the specified file at the current point within the current input file. ‘FILENAME’ is interpreted relative to the name of the current input file.

If the ‘%readonly’ keyword is supplied, then any output files that are generated by the included file must be read-only. That is, no changes are expected by performing the inclusion.

The ‘%readonly’ keyword is useful for building compilers in layers. The programmer may group a large number of useful node types and operations together that are independent of the particulars of a given language. The programmer then defines language-specific compilers that "inherit" the common definitions.

Read-only inclusions ensure that any extensions that are added by the language-specific parts do not "leak" into the common code.

Output files can be changed using the follow declarations:

‘%header FILENAME’

Change the currently active declaration header file to ‘FILENAME’, which is interpreted relative to the current input file. This option has no effect for languages without header files (Java and C#).

Any node types and operations that are defined after a ‘%header’ declaration will be declared in ‘FILENAME’.

‘%output FILENAME’

Change the currently active source file to ‘FILENAME’, which is interpreted relative to the current input file. This option has no effect for languages that require a single class per file (Java).

Any node types and operations that are defined after a ‘%header’ declaration will have their implementations placed in ‘FILENAME’.

‘%outdir DIRNAME’

Change the output source directory to ‘DIRNAME’. This is only used for Java, which requires that a single file be used for each class. All classes are written to the specified directory. By default, ‘DIRNAME’ is the current directory where treecc was invoked.

When treecc generates the output source code, it must insert several common house-keeping functions and classes into the code. By default, these are written to the first header and source files. This can be changed with the ‘%common’ declaration:

‘%common’

Output the common house-keeping code to the currently active declaration header file and the currently active source file. This is typically used as follows:

%header "common.h"
%output "common.c"
%common

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Klaus Treichel on January, 18 2009 using texi2html 1.78.