GNU Source-highlight 2.9

Table of Contents


Next: , Previous: (dir), Up: (dir)

GNU Source-highlight

GNU Source-highlight, given a source file, produces a document with syntax highlighting.

This is Edition 2.9 of the Source-highlight manual.

This file documents GNU Source-highlight version 2.9.

This manual is for GNU Source-highlight (version 2.9, 26 February 2008), which given a source file, produces a document with syntax highlighting.

Copyright © 2005-2007 Lorenzo Bettini.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with the Front-Cover Texts being “A GNU Manual,” and with the Back-Cover Texts as in (a) below. A copy of the license is included in the section entitled “GNU Free Documentation License.”

(a) The FSF's Back-Cover Text is: “You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.”


Next: , Previous: Top, Up: Top

1 Introduction

GNU Source-highlight, given a source file, produces a document with syntax highlighting. The colors and the styles can be specified (bold, italics, underline) by means of a configuration file, and some other options can be specified at the command line.

The program already recognizes many programming languages (e.g., C++, Java, Perl, etc.) and file formats (e.g., log files, ChangeLog, etc.), and some output formats (e.g., HTML, ANSI color escape sequences, LaTeX, etc.). Since version 2.0, it allows you to specify your own input source language via a simple syntax described later in this manual (Language Definitions). Since version 2.1, it allows you to specify your own output format language via a simple syntax described later in this manual (Output Language Definitions). Since version 2.2, it is able to generate cross references (e.g., to variable names, field names, etc.) by relying on the program ctags, http://ctags.sourceforge.net (Generating References).


Next: , Previous: Introduction, Up: Introduction

1.1 Supported languages

The complete list of languages (indeed, file extensions) natively supported by this version of Source-highlight (2.9), as reported by --lang-list, is the following:

     Supported languages (file extensions)
     and associated language definition files
     
     C = cpp.lang
     H = cpp.lang
     am = makefile.lang
     bib = bib.lang
     bison = bison.lang
     c = c.lang
     caml = caml.lang
     cc = cpp.lang
     changelog = changelog.lang
     cls = latex.lang
     cpp = cpp.lang
     cs = csharp.lang
     csh = sh.lang
     csharp = csharp.lang
     css = css.lang
     desktop = desktop.lang
     diff = diff.lang
     docbook = xml.lang
     dtx = latex.lang
     eps = postscript.lang
     flex = flex.lang
     fortran = fortran.lang
     h = cpp.lang
     haxe = haxe.lang
     hh = cpp.lang
     hpp = cpp.lang
     htm = html.lang
     html = html.lang
     hx = haxe.lang
     in = makefile.lang
     ini = desktop.lang
     java = java.lang
     javascript = javascript.lang
     js = javascript.lang
     kcfg = xml.lang
     kdevelop = xml.lang
     kidl = xml.lang
     ksh = sh.lang
     l = flex.lang
     lang = langdef.lang
     langdef = langdef.lang
     latex = latex.lang
     lex = flex.lang
     lgt = logtalk.lang
     ll = flex.lang
     log = log.lang
     logtalk = logtalk.lang
     lsm = lsm.lang
     lua = lua.lang
     m4 = m4.lang
     makefile = makefile.lang
     ml = caml.lang
     mli = caml.lang
     moc = cpp.lang
     outlang = outlang.lang
     pas = pascal.lang
     pascal = pascal.lang
     patch = diff.lang
     perl = perl.lang
     php = php.lang
     php3 = php.lang
     php4 = php.lang
     php5 = php.lang
     pl = prolog.lang
     pm = perl.lang
     postscript = postscript.lang
     prolog = prolog.lang
     properties = properties.lang
     ps = postscript.lang
     py = python.lang
     python = python.lang
     rb = ruby.lang
     rc = xml.lang
     ruby = ruby.lang
     sh = sh.lang
     shell = sh.lang
     sig = sml.lang
     sl = slang.lang
     slang = slang.lang
     slsh = slang.lang
     sml = sml.lang
     spec = spec.lang
     sql = sql.lang
     sty = latex.lang
     style = style.lang
     syslog = log.lang
     tcl = tcl.lang
     tcsh = sh.lang
     tex = latex.lang
     tk = tcl.lang
     txt = nohilite.lang
     ui = xml.lang
     xhtml = xml.lang
     xml = xml.lang
     y = bison.lang
     yacc = bison.lang
     yy = bison.lang

The complete list of output formats natively supported by this version of Source-highlight (2.9), as reported by --outlang-list, is the following:

     Supported output languages
     and associated language definition files
     
     docbook = docbook.outlang
     docbook-doc = docbookdoc.outlang
     esc = esc.outlang
     esc-doc = esc.outlang
     html = html.outlang
     html-css = css_common.outlang
     html-css-doc = htmlcss.outlang
     html-doc = htmldoc.outlang
     htmltable = htmltable.outlang
     javadoc = javadoc.outlang
     latex = latex.outlang
     latex-doc = latexdoc.outlang
     latexcolor = latexcolor.outlang
     latexcolor-doc = latexcolordoc.outlang
     texinfo = texinfo.outlang
     xhtml = xhtml.outlang
     xhtml-css = css_common.outlang
     xhtml-css-doc = xhtmlcss.outlang
     xhtml-doc = xhtmldoc.outlang
     xhtmltable = xhtmltable.outlang

The meaning of the suffixes -doc, -css and -css-doc is explained in Output Language map.

Please, keep in mind, that I haven't tested personally all these language definitions: I actually checked that the definition files are syntactically correct (with the command line option --check-lang, Invoking source-highlight), but I'm not sure their definition actually respects that language syntax (e.g., I've put up together some language definitions by searching for information in the Internet, but I've never programmed in that language). So, if you find that a language definition is not precise, please let me know. Moreover, if you have a program example in a language that's not included in the tests directory, please send it to me so that I can include it in the test suite.


Next: , Previous: Supported languages, Up: Introduction

1.2 Using source-highlight as a simple formatter

You can also use source-highlight as a simple formatter of input file, i.e., without performing any highlighting1.

You can achieve this by using, as the language definition file for input sources the file nohilite.lang, using the command line option --lang-def (Invoking source-highlight). Since that language definition is empty, no highlighting will be performed; however, source-highlight will transform the input file in the output format. Notice, in the input language associations in Supported languages, that nohilite.lang is also associated to txt files.

This, for instance, makes source-highlight useful in cases you want to transform a text file into HTML or LaTeX. During the output, in fact, source-highlight will correctly generate characters that have a specific meanings in the output format.

For instance, in this Texinfo manual, if I want to insert a @ or a { I have to “escape” them to make them appear literally since they have a special meaning in Texinfo. The same holds, e.g., for <, > or & in HTML. If you use source-highlight, it will take care of this, automatically for you. This is the Texinfo source of the above sentence:

     For instance, in this Texinfo manual,
     if I want to insert a @@ or a @{
     I have to ``escape'' them to make them appear literally
     since they have a special meaning in Texinfo.
     The same holds, e.g.,
     for @code{<}, @code{>} or @code{&} in HTML.
     If you use source-highlight,
     it will take care of this, automatically for you.

This was processed by source-highlight as a simple text file, without no highlighting; however since it was formatted in Texinfo, all the necessary escaping was automatically performed. This way, it is very easy to insert, in the same document, a code, and its result (as in this example).

This is actually the formatting performed by source-highlight; except for the comment, this is basically what you should have written yourself to do all the escaping stuff manually:

     @c Generator: GNU source-highlight, by Lorenzo Bettini, http://www.gnu.org/software/src-highlite
     @example
     For instance, in this Texinfo manual,
     if I want to insert a @@@@ or a @@@{
     I have to ``escape'' them to make them appear literally
     since they have a special meaning in Texinfo.
     The same holds, e.g.,
     for @@code@{<@}, @@code@{>@} or @@code@{&@} in HTML.
     If you use source-highlight,
     it will take care of this, automatically for you.
     @end example

In case source-highlight does not handle a specific input language, you can still use the option --failsafe (Invoking source-highlight) and also in that case no highlighting will be performed, but source-highlight will transform the input file in the output format.

Notice, however, that if the input language cannot be established, the default.lang will be used: an empty language definition file which you might want to customize.


Previous: Using source-highlight as a simple formatter, Up: Introduction

1.3 Related Software and Links

Here we list some software related to source-highlight in the sense that it uses it as a backend (i.e., provides an interface to source-highlight) or it uses some of its features (e.g., definition files):


Next: , Previous: Introduction, Up: Top

2 Installation

See the file INSTALL for detailed building and installation instructions; anyway if you're used to compiling Linux software that comes with sources you may simply follow the usual procedure, i.e., untar the file you downloaded in a directory and then:

     cd <source code main directory>
     ./configure
     make
     make install

However, before you do this, please check that you have everything that is needed to build source-highlight, What you need to build source-highlight.

Note: unless you specify a different install directory by --prefix option of configure (e.g. ./configure --prefix=<your home>), you must be root to run make install.

Files will be installed in the following directories:

Executables
/prefix/bin
docs and samples
/prefix/share/doc/source-highlight
conf files
/prefix/share/source-highlight

Default value for prefix is /usr/local but you may change it with --prefix option to configure.

NOTICE: Originally, instead of Source-highlight, there were two separate programs, namely GNU java2html and GNU cpp2html. There are two shell scripts with the same name that will be installed together with Source-highlight in order to facilitate the migration (however their use is not advised and it is deprecated).


Next: , Previous: Installation, Up: Installation

2.1 Download

You can download it from GNU's ftp site: ftp://ftp.gnu.org/gnu/src-highlite or from one of its mirrors (see http://www.gnu.org/prep/ftp.html).

I do not distribute Windows binaries anymore; since, they can be built by using Cygnus C/C++ compiler, available at http://www.cygwin.com. However, if you don't feel like downloading such compiler or you experience problems with the Boost Regex library (see also Tips on installing Boost Regex library; please also keep in mind that if you don't have these libraries installed, and your C/C++ compiler distribution does not provide a prebuilt package, it might take some time, even hours, to build the Boost libraries from sources), you can request such binaries directly to me, by e-mail (find my e-mail at my home page) and I'll be happy to send them to you. An MS-Windows port of Source-highlight is available from http://gnuwin32.sourceforge.net; however, I don't maintain those binaries personally, and they might be out of date.

Archives are digitally signed by me (Lorenzo Bettini) with GNU gpg (http://www.gnupg.org). My GPG public key can be found at my home page (http://www.lorenzobettini.it).

You can also get the patches, if they are available for a particular release (see below for patching from a previous version).


Next: , Previous: Download, Up: Installation

2.2 Anonymous CVS Access

This project's CVS repository can be checked out through anonymous (pserver) CVS with the following instruction:

     cvs -z3 -d:pserver:anonymous@cvs.savannah.gnu.org:/sources/src-highlite co src-highlite

Further instructions can be found at the address:

http://savannah.gnu.org/projects/src-highlite.

Please notice that this way you will get the latest development sources of Source-highlight, which may also be unstable. This solution is the best if you intend to correct/extend this program: you should send me patches against the latest cvs repository sources.

If, on the contrary, you want to get the sources of a given release, through cvs, say, e.g., version X.Y.Z, you must specify the tag rel_X_Y_Z when you run the cvs command or the cvs update command.

NOTICE: This convention holds since release 2.1.

When you compile the sources that you get through the cvs repository, before running the configure and make commands, you should, at least the first time, run the command:

     sh autogen.sh

This will run the autotools commands in the correct order, and also copy possibly missing files. You should have installed recent versions of automake, autoconf and libtool in order for this to succeed. You will also need flex and bison.


Next: , Previous: Anonymous CVS Access, Up: Installation

2.3 What you need to build source-highlight

Since version 2.0 Source-highlight relies on regular expressions as provided by boost (http://www.boost.org), so you need to install at least the regex library from boost.

Most GNU/Linux distributions provide this library already in a compiled form. If you use your distribution packages, please be sure to install also the development package of the boost libraries.

If you experience problems in installing Boost Regex library, or in compiling source-highlight because of this library, please take a look at Tips on installing Boost Regex library.

If you want to use a specific version of the Boost regex library (because you have many versions of it), you can use the configure option --with-boost-regex to specify a particular suffix. For instance,

     ./configure --with-boost-regex=boost_regex-gcc-1_31

Source-highlight has been developed under GNU/Linux, using gcc (C++), and bison (yacc) and flex (lex), and ported under Win32 with Cygnus C/C++compiler, available at http://www.cygwin.com.

I use the excellent GNU Autoconf2, GNU Automake3 and GNU Libtool4. Since version 2.6 I also started to use Gnulib - The GNU Portability Library5, “a central location for common GNU code, intended to be shared among GNU packages” (for instance, I rely on Gnulib for checking for the presence and correctness of getopt_long function).

Finally I used GNU gengetopt (http://www.gnu.org/software/gengetopt), for command line parsing.

I started to use also doublecpp (http://doublecpp.sourceforge.net) that permits achieving dynamic overloading.

Actually, apart from the boost regex library, you don't need the other tools above to build source-highlight (indeed I provide the output sources generated by the above mentioned tools), unless you want to develop source-highlight.

However, if you obtained sources through CVS, you need some other tools, see Anonymous CVS Access.


Next: , Previous: What you need to build source-highlight, Up: Installation

2.4 Tips on installing Boost Regex library

I created this section because many users reported some problems after installing Boost Regex library from sources; other users had problems in compiling source-highlight even if this library was already correctly installed (especially windows users, using cygwin). I hope this section sheds some light in installing/using the Boost Regex library. Please, notice that this section does not explain how to compile the Boost libraries (the documentation you'll find on http://www.boost.org is well done); it explains how to tweak things if you have problems in compiling source-highlight even after a successful installation of Boost libraries.

If you experience no problem in compiling source-highlight, you can happily skip this section :-)

First of all, if your distribution provides packages for the Boost regex library, please be sure to install also the development package of the boost libraries, i.e., those providing also the header files needed to compile a program using these libraries. For instance, on my Debian system I had to install the package libboost-regex-dev, besides the package libboost-regex.

If your distribution does not provide these packages then you have to download the sources of Boost libraries from http://www.boost.org and follow the instructions for compilation and installation. However, I suggest you specify /usr as prefix for installation, instead of relying on the default prefix /usr/local (unless /usr/local/include is already in the inclusion path of your C++ compiler), since this will make things easier when compiling source-highlight. I suggest this, since /usr/include is usually the place where C++ searches for header files during compilation.

If you successfully compiled and installed the Boost Regex library, or you installed the package from your distribution, but you STILL experience problems in compiling source-highlight, then you simply have to adjust some things as described in the following.

If the ./configure command of source-highlight reports this error:

     ERROR! Boost::regex library not installed.

then, the compiler cannot find the header files for this library. In this case, check that the directory /usr/include/boost actually exists; if it does not, then probably you'll find a similar directory, e.g., /usr/include/boost-1_33/boost, depending on the version of the library you have installed. Then, all you have to do is to create a symbolic link as follows:

     ln -s /usr/include/boost-1_33/boost /usr/include/boost

Alternatively, you might run source-highlight's configure as follows:

     ./configure CXXFLAGS=-I/usr/include/boost-1_33/

If then ./configure command of source-highlight reports this other error:

     ERROR! Boost::regex library is installed, but you
     must specify the suffix with --with-boost-regex at configure
     for instance, --with-boost-regex=boost_regex-gcc-1_31

then, there's still another thing to fix: you must find out the exact names of the files of your installed Boost Regex libraries; you can do this by using the command:

     $ ls -l /usr/lib/libboost_regex*

that, for instance, on one of my cygwin installation reports:

     -rwxr-x---+ Nov  9 23:29 /usr/lib/libboost_regex-gcc-mt-s-1_33.a
     -rwxr-x---+ Nov 22 09:22 /usr/lib/libboost_regex-gcc-mt-s.a
     -rwxr-x---+ Nov  9 23:29 /usr/lib/libboost_regex-gcc-mt-s-1_33.so
     -rwxr-x---+ Nov 22 09:22 /usr/lib/libboost_regex-gcc-mt-s.so

Now, you have all the information to correctly run the source-highlight's configure command:

     ./configure --with-boost-regex=boost_regex-gcc-mt-s-1_33

or, if you solved the first problem in the second way6,

     ./configure CXXFLAGS=-I/usr/include/boost-1_33/ \
                 --with-boost-regex=boost_regex-gcc-mt-s-1_33

Of course, you have to modify this command according to the names of your Boost Regex library installed files.

These instructions managed to let many users, who were experiencing problems, to compile source-highlight If you still have problems, please send me an e-mail.


Next: , Previous: Tips on installing Boost Regex library, Up: Installation

2.5 Patching from a previous version

If you downloaded a patch, say source-highlight-1.3-1.3.1-patch.gz (i.e., the patch to go from version 1.3 to version 1.3.1), cd to the directory with sources from the previous version (source-highlight-1.3) and type:

     gunzip -cd ../source-highlight-1.3-1.3.1.patch.gz | patch -p1

and restart the compilation process (if you had already run configure a simple make should do).


Next: , Previous: Patching from a previous version, Up: Installation

2.6 Using source-highlight with less

This was suggested by Konstantine Serebriany. The script src-hilite-lesspipe.sh will be installed together with source-highlight. You can use the following environment variables:

     export LESSOPEN="| /path/to/src-hilite-lesspipe.sh %s"
     export LESS=' -R '

This way, when you use less to browse a file, if it is a source file handled by source-highlight, it will be automatically highlighted.


Next: , Previous: Using source-highlight with less, Up: Installation

2.7 Using source-highlight as a CGI

CGI support was enabled thanks to Robert Wetzel; I haven't tested it personally. If you want to use source-highlight as a CGI program, you have to use the executable source-highlight-cgi. You can build such executable by issuing

     make source-highlight-cgi

in the src directory.


Previous: Using source-highlight as a CGI, Up: Installation

2.8 Building .rpm

Christian W. Zuckschwerdt added support for building an .rpm and an .rpm.src. You can issue the following command

     rpm -tb source-highlight-2.9.tar.gz

for building an .rpm with binaries and

     rpm -ts source-highlight-2.9.tar.gz

for building an .rpm.src with sources.


Next: , Previous: Installation, Up: Top

3 Copying Conditions

GNU Source-highlight is free software; you are free to use, share and modify it under the terms of the GNU General Public License that accompanies this software (see COPYING).

GNU source-highlight was written and maintained by Lorenzo Bettini http://www.lorenzobettini.it.


Next: , Previous: Copying, Up: Top

4 Simple Usage

Here are some realistic examples of running source-highlight7.

Source-highlight only does a lexical analysis of the source code, so the program source is assumed to be correct!

Here's how to run source-highlight (for this example we will use C/C++ input files, but this is valid also for other source-highlight input languages):

     source-highlight --src-lang cpp --out-format html \
         --input <C++ file> \
         --output <html file> \
         --style-file <style file> \
         options

For input files, apart from the -i (--input) option and the standard input redirection, you can simply specify some files at the command line and also use regular expressions (for instance *.java). In this case the name for the output files will be formed using the name of the source file with a .<ext> appended, where <ext> is the extension chosen according to the output format specified (in this example it would be .html). The style file (Output format style) contains information on how to format specific language parts (e.g., keywords in blue and boldface, etc.).

If STDOUT string is passed as -o (--output) option, then the output is forced to the standard output anyway.

If -s (--src-lang) is not specified, the source language is inferred by the extension of the input file (this, of course, does not work with standard input redirection). For further details, see How the input language is discovered.

If -f (--out-format) is not specified, the output will be produced in HTML.

If --style-file is not specified, the default.style, which is included in the distribution, will be used (see Output format style for further information).


Next: , Previous: Simple Usage, Up: Simple Usage

4.1 HTML and XHTML output

The default output format for HTML and XHTML uses fixed width fonts by inserting all the formatted output between <tt> and </tt>. Thus, for instance, specification for fixed width and not fixed width (see Output format style) will have no effect: every character will have fixed width. If you don't like this default behavior and would like to have not fixed fonts by default (as it happens, e.g., with LaTeX output) you can use the file html_notfixed.outlang with the command line argument --outlang-def. For XHTML output, the corresponding file is xhtml_notfixed.outlang

Furthermore, the file htmltable.outlang can be used to generate HTML output enclosed in an HTML table (which will use also a background color if specified in the style file). The file xhtmltable.outlang does the same but for XHTML output.


Next: , Previous: HTML and XHTML output, Up: Simple Usage

4.2 LaTeX output

When using LaTeX output format you can choose between monochromatic output (by using -f latex) or colored output (by using -f latexcolor). When using colored output, you need the color package (again this should be present in your system). Of course, you are free to define your own LaTeX output format, see Output Language Definitions.


Next: , Previous: LaTeX output, Up: Simple Usage

4.3 Texinfo output

When using the Texinfo output format, you may want to use a dedicated style file, texinfo.style, which comes with the source-highlight distribution, with the option --style-file. For instance, the example in Examples is formatted with this style file.


Next: , Previous: Texinfo output, Up: Simple Usage

4.4 DocBook output

DocBook output is generated using the <programlisting> tag. If the --doc command line option is given, an <article> document is generated.


Previous: DocBook output, Up: Simple Usage

4.5 ANSI color escape sequences

If you're using this output format, for instance together with less (see Using source-highlight with less), you may want to use the esc.style, which comes with the source-highlight distribution, with the option --style-file. This should result in a more pleasant coloring output.


Next: , Previous: Simple Usage, Up: Top

5 Configuration files

During execution, source-highlight needs some files where it finds directives on how to recognize the source language (if not specified explicitly with --src-lang or --lang-def), on which output format to use (if not specified explicitly with --out-format or --outlang-def), on how to format specific source elements (e.g., keywords, comments, etc.), and source and output language definitions. These files will be explained in the next sections.

If the directory for such files is not explicitly specified with the command line option --data-dir, these files are searched for in the following order:

If you want to be sure about which file is used during the execution, you can use the command line option --verbose.


Next: , Previous: Configuration files, Up: Configuration files

5.1 Output format style

You must specify your options for syntax highlighting in the file default.style8. You can specify formatting options for each element defined by a language definition file (you can get the list of such elements, by using --show-lang-elements, see Listing Language Elements).

Since version 2.6, you can also specify the background color for the output document, using the keyword bgcolor (this might be visible only when the --doc command line option is used).

If many elements share the same formatting options, you can specify these elements in the same line, separated by a comma9.

Here's the default.style that comes with this distribution (this is formatted by using the style.lang that is shown in Tutorials on Language Definitions):

     bgcolor "white"; // the background color for documents
     
     keyword blue b ; // for language keywords
     type darkgreen ; // for basic types
     string red f ; // for strings and chars
     regexp orange f ; // for strings and chars
     specialchar pink f ; // for special chars, e.g., \n, \t, \\
     comment brown i, noref; // for comments
     number purple ;       // for literal numbers
     preproc darkblue b ; // for preproc directives (e.g. #include, import)
     symbol darkred ; // for simbols (e.g. <, >, +)
     function black b; // for function calls and declarations
     cbracket red; // for block brackets (e.g. {, })
     todo bg:cyan b;       // for TODO and FIXME
     
     // for OOP
     classname darkgreen ; // for class names, e.g., in Java and C++
     
     // line numbers
     linenum black f;
     
     // Internet related
     url blue u, f;
     
     // other elements for ChangeLog and Log files
     date blue b ;
     time, file darkblue b ;
     ip, name darkgreen ;
     
     // for Prolog, Perl...
     variable darkgreen ;
     
     // explicit for Latex
     italics darkgreen i;
     bold darkgreen b;
     underline darkgreen u;
     fixed green f;
     argument darkgreen;
     optionalargument purple;
     math orange;
     bibtex blue;
     
     // for diffs
     oldfile orange;
     newfile darkgreen;
     difflines blue;
     
     // for css
     selector purple;
     property blue;
     value darkgreen i;

This file tries to define a style for most elements defined in the language definition files that comes with Source-highlight distribution.

You can specify your own file (it doesn't have to be named default.style) with the command line option --style-file10, see Invoking source-highlight.

You can also specify the color of normal text by adding this line

     normal darkblue ;

As you might see the syntax of this file is quite straightforward: after the element (or elements, separated by commas) you can specify the color, and the background color11 by using the prefix bg: (for instance, in the default.style above the background color is specified for the todo element).

Notice that the background color might not be available for all ouput formats: it is available for XHTML and LaTeX but not for HTML12.

Then, you can specify further formatting options such as bold, italics, etc.; these are the keywords that can be used:

     b = bold
     i = italics
     u = underline
     f = fixed
     nf = not fixed
     noref = no reference information is generated for these elements

Since version 2.2, the color specification is not required. For instance, the texinfo.style is as follows (we avoid colors for Texinfo outputs):

     keyword, type b ;
     variable f, i ;
     string f ;
     regexp f ;
     comment nf, i, noref ;
     preproc b ;
     
     // line numbers
     linenum f;
     
     // Internet related
     url f;
     
     // for diffs
     oldfile, newfile i;
     difflines b;
     
     // for css
     selector, property b;
     value i;

You may also specify more than on of these options separated by commas, e.g.

     keyword blue u, b ;

Please keep in mind that in this case the order of these specified options is kept during the generation of the output; for instance, depending on the specific output format, the sequences u, b and b, u may lead to different results. In particular, the style that comes first is used after the ones that follow. For instance, in the case of HTML, the sequence u, b will lead to the following formatting: <u><b>...</b></u>.

The noref option specifies that for this element reference information are not generated (see Generating References). For instance, this is used for the comment element, since we do not want that elements in a comment are searched for cross-references.

These are all possible color logical names handled by source-highlight13:

     black
     red
     darkred
     brown
     yellow
     cyan
     blue
     pink
     purple
     orange
     brightorange
     green
     brightgreen
     darkgreen
     teal
     gray
     darkblue

You can also use the direct color scheme for the specific output format, by using double quotes, such as, e.g., "#00FF00" in HTML14 or even string colors in double quotes15, such as "lightblue". Of course, the double quotes will be discarded during the generation.

For instance, this is the syslog.style used in the tests directory. This uses direct color schemes.

     date, keyword yellow b ;
     time "#9999FF" ;
     ip "lightblue" b ;
     
     type cyan b ;
     string "brown" b ;
     comment teal ;
     number red ;
     preproc cyan ;
     symbol green ;
     function "#CC66CC" b ;
     cbracket green b ;
     twonumbers green b ;
     port green b ;
     webmethod teal ;
     
     // foo option
     foo red b ; // foo entry
     
     

Notice that, if you use direct color schemes, source-highlight will perform no transformation, and will output exactly the color scheme you specified. For instance, the specification "brown" is different from brown: the former will be output as it is, while the latter will be translated in the corresponding color of the output format (for HTML the visible result is likely to be the same).

It is up to you to specify a color scheme string that is handled by the specific output format. Thus, direct color schemes might not be portable in different output formats; for instance, "#00FF00" is valid in HTML but not in LaTeX.


Next: , Previous: Output format style, Up: Configuration files

5.2 Output format style using CSS

Since version 2.6 you can specify the output format style also using a limited CSS syntax. Please, notice that this has nothing to do with output produced by source-highlight using the --css option.

By using a CSS file as the style file (i.e., passing it to the --style-css-file command line option) you will only specify the output format style using the same syntax of CSS. This means that you can use a css syntax for specifying the output format style independently from the actual output (this is what the output format style is for). Thus, you can use a css file as the output format style also for LaTeX output (just like you would do with a source-hihglight output format style, Output format style).

This feature is provided basically for code re-use: you can specify the output format style using a css file, and then re-use the same css file as the actual style sheet of other HTML pages (or even output files produced by source-highlight using the --css option).

Notice that this feature is quite primordial, so only a limited subset of CSS syntax is recognized. In particular, selectors are always intended as CSS class selectors, so they must start with a dot. /* */ comments are handled. Properties (and their values) not handled by source-highlight are simply (and silently) discarded).

This is an example of CSS specification handled correctly by source-highlight as a style format specification:

     body {
       background-color: <color specification>;
      }
     
     .selector {
       color: <color specification>;
       background-color: <color specification>;
       font-weight: bold; /* this is a comment */
       font-family: monospace;
       font-style: italic;
       text-decoration: underline;
      }

Finally, this is the default.css that corresponds to default.style presented in Output format style:

     body {  background-color: white;  }
     
     .keyword { color: blue; font-weight: bold; }
     .type, .classname { color: darkgreen; }
     .string { color: red; font-family: monospace; }
     .regexp { color: orange; }
     .specialchar { color: pink; font-family: monospace; }
     .comment { color: brown; font-style: italic; }
     .number { color: purple; }
     .preproc { color: darkblue; font-weight: bold; }
     .symbol { color: darkred; }
     .function { color: black; font-weight: bold; }
     .cbracket { color: red; }
     .todo { font-weight: bold; background-color: cyan; }
     
     /* line numbers */
     .linenum { color: black; font-family: monospace; }
     
     /* Internet related */
     .url { color: blue; text-decoration: underline; font-family: monospace; }
     
     /* other elements for ChangeLog and Log files */
     .date { color: blue; font-weight: bold; }
     .time, .file { color: darkblue; font-weight: bold; }
     .ip, .name { color: darkgreen; }
     
     /* for Prolog, Perl */
     .variable { color: darkgreen; }
     .italics { color: darkgreen; font-style: italic; }
     .bold { color: darkgreen; font-weight: bold; }
     
     /* for LaTeX */
     .underline { color: darkgreen; text-decoration: underline; }
     .fixed { color: green; font-family: monospace; }
     .argument, .optionalargument { color: darkgreen; }
     .math { color: orange; }
     .bibtex { color: blue; }
     
     /* for diffs */
     .oldfile { color: orange; }
     .newfile { color: darkgreen; }
     .difflines { color: blue; }
     
     /* for css */
     .selector { color: purple; }
     .property { color: blue; }
     .value { color: darkgreen; font-style: italic; }

If you pass this file to the --style-css-file command line option and you produce an output file, you will get the same result of using default.style.

Source-highlight comes with a lot of CSS files that can be used either as standard CSS files for HTML documents, or as style files to pass to --style-css-file. In the documentation installation directory (see Installation) you will find the file style_examples.html which shows many output examples, each one with a different CSS style.


Next: , Previous: Output format style using CSS, Up: Configuration files

5.3 Default Styles

This file16 (the default file is style.defaults) lists the default style for a language element whose output style is not specified in the style file; in particular the following line (comment lines start with #):

     elem1 = elem2

tells that, if the style for an element, say elem1, is not specified in the style file, then elem1 will have the same style of elem2.

For instance, this is the style.defaults that comes with Source-highlight:

     # defaults for styles
     # the format is:
     # elem1 = elem2
     # meaning that if the style for elem1 is not specified,
     # then it will have the same style as elem2
     
     classname = normal
     preproc = keyword
     section = function
     paren = cbracket

In this case the style for the element preproc will default to the style of the element keyword.

This file is useful when you want to create your own style file and you don't want to specify styles for all the elements that will have the same output style in your style (e.g., the default style formats preproc elements differently from keywords, but if in your style you don't specify a style for it, a preproc element will still be formatted as a keyword).


Next: , Previous: Default Styles, Up: Configuration files

5.4 Language map

This configuration file associates a file extension to a specific language definition file. You can also use such file extension to specify the --src-lang option (see Simple Usage). Source-highlight comes with such a file, called lang.map.

Of course, you can override the settings of this file by writing your own language map file and specify such file with the command line option --lang-map). Moreover, as explained above, if a file lang.map is present in the current directory, such version will be used. The format of such file is quite simple (comment lines start with #):

     extension = language definition file

The default language definition file is shown in Introduction.


Next: , Previous: Language map, Up: Configuration files

5.5 Language definition files

These files are crucial for source-highlight since they specify the source elements that have to be highlighted. These files also allow to specify your own language definitions in order to deal with a language that is not handled by source-highlight17. The syntax for these files is explained in Language Definitions.


Next: , Previous: Language definition files, Up: Configuration files

5.6 Output Language map

This configuration file associates an output format to a specific output language definition file. You can use the name of that output format to specify the --out-format option (see Simple Usage). Source-highlight comes with such a file, called outlang.map.

Of course, you can override the settings of this file by writing your own output language map file and specify such file with the command line option --outlang-map). Moreover, as explained above, if a file outlang.map is present in the current directory, such version will be used. The format of such file is quite simple:

     output format name = language definition file

The default language definition file is shown in Introduction.

In particular, there is a convention for the output format name in the output language map, according to the suffix of the name with a dash -:

-doc
The one used when --doc command line option is given
-css-doc
The one used when --css command line option is given
-css
The one used when --css and --no-doc command line options are given

If a combination of the above mentioned command line options is given for a specific output format, and a corresponding definition file is not specified in the map file, then an error is raised.

For instance, if you specified the definition file for your language mylang and also one for dealing with --doc option, i.e., a definition file for mylang-doc, and you run source-highlight as follows:

source-highlight -f mylang --css mycss.css

You will get the following error:

source-highlight: output language mylang-css-doc not handled


Next: , Previous: Output Language map, Up: Configuration files

5.7 Output Language definition files

These files are crucial for source-highlight since they specify how the source elements are highlighted. These files also allow to specify your own output format definitions in order to deal with an output format that is not handled by source-highlight18. The syntax for these files is explained in Output Language Definitions.


Previous: Output Language definition files, Up: Configuration files

5.8 Developing your own definition files

I encourage those who write new language definitions or correct/modify existing language definitions to send them to me so that they can be added to the source-highlight distribution!

Since these files require more explanations (that, however, are not necessary to the standard usage of source-highlight), they are carefully explained in separate parts: Language Definitions and Output Language Definitions.


Next: , Previous: Configuration files, Up: Top

6 Invoking source-highlight

The format for running the source-highlight program is:

     source-highlight option ...

source-highlight supports the following options, shown by the output of source-highlight --help:

     source-highlight
     
     Highlight the syntax of a source file (e.g. Java) into a specific format (e.g.
     HTML)
     
     Usage: source-highlight [OPTIONS]...
     
       -h, --help                    Print help and exit
       -V, --version                 Print version and exit
       -i, --input=filename          input file. default std input
       -o, --output=filename         output file. default std output. If STDOUT is
                                       specified, the output is directed to standard
                                       output
       -s, --src-lang=STRING         source language (use --lang-list to get the
                                       complete list).  If not specified, the source
                                       language will be guessed from the file
                                       extension.
           --lang-list               list all the supported language and associated
                                       language definition file
           --outlang-list            list all the supported output language and
                                       associated language definition file
       -f, --out-format=STRING       output format (use --outlang-list to get the
                                       complete list)  (default=`html')
       -d, --doc                     create an output file that can be used as a
                                       stand alone document (e.g., not to be
                                       included in another one)
           --no-doc                  cancel the --doc option even if it is implied
                                       (e.g., when css is given)
       -c, --css=filename            the external style sheet filename.  Implies
                                       --doc
       -T, --title=STRING            give a title to the output document.  Implies
                                       --doc
       -t, --tab=INT                 specify tab length.  (default=`8')
       -H, --header=filename         file to insert as header
       -F, --footer=filename         file to insert as footer
           --style-file=filename     specify the file containing format options
                                       (default=`default.style')
           --style-css-file=filename specify the file containing format options (in
                                       css syntax)
           --style-defaults=filename specify the file containing defaults for format
                                       options  (default=`style.defaults')
           --outlang-def=filename    output language definition file
           --outlang-map=filename    output language map file
                                       (default=`outlang.map')
           --data-dir=path           directory where language definition files and
                                       language maps are searched for.  If not
                                       specified these files are searched for in the
                                       current directory and in the data dir
                                       installation directory
           --output-dir=path         output directory
           --lang-def=filename       language definition file
           --lang-map=filename       language map file  (default=`lang.map')
           --show-lang-elements=filename
                                     prints the language elements that are defined
                                       in the language definition file
           --infer-lang              force to infer source script language
                                       (overriding given language specification)
     
     reference generation:
       -n, --line-number[=padding]   number all output lines, using the specified
                                       padding character  (default=`0')
           --line-number-ref[=prefix]
                                     number all output lines and generate an anchor,
                                       made of the specified prefix + the line
                                       number  (default=`line')
           --gen-references=STRING   generate references  (possible
                                       values="inline", "postline", "postdoc"
                                       default=`inline')
           --ctags-file=filename     specify the file generated by ctags that will
                                       be used to generate references
                                       (default=`tags')
           --ctags=cmd               how to run the ctags command.  If this option
                                       is not specified, ctags will be executed with
                                       the default value.  If it is specified with
                                       an empty string, ctags will not be executed
                                       at all  (default=`ctags --excmd=n
                                       --tag-relative=yes')
     
     testing:
       -v, --verbose                 verbose mode on
       -q, --quiet                   print no progress information
           --statistics              print some statistics (i.e., elapsed time)
           --gen-version             put source-highlight version in the generated
                                       file  (default=on)
           --check-lang=filename     only check the correctness of a language
                                       definition file
           --check-outlang=filename  only check the correctness of an output
                                       language definition file
           --failsafe                if no language definition is found for the
                                       input, it is simply copied to the output
       -g, --debug-langdef[=type]    debug a language definition.  In dump mode just
                                       dumps all the steps; in interactive, at each
                                       step, waits for some input (press ENTER to
                                       step)  (possible values="interactive",
                                       "dump" default=`dump')
           --show-regex=filename     show the regular expression automaton
                                       corresponding to a language definition file

Let us explain some options in details (apart from those that should be clear from the --help output itself, and those already explained in Simple Usage).

--doc
-d
If you want a stand alone output document (i.e., an output file that is not thought to be included in another document), specify this option (otherwise you just get some text that you can paste into another document). If you choose this option and do not provide a --title, the your source file name will be used as the title.
--no-doc
The --doc option above is actually implied by other command line options (e.g., --css). If you do not want this (e.g., you want to include the output in an existing document containing the global style sheet), you can disable this by using --no-doc.
--css
-c
Specify the style sheet file (e.g., a .css for HTML19) for the output document. Notice that source-highlight will not use this file: it will simply use this file name when generating the output file, so to specify that the output file uses this file as the style sheet (e.g., if the generated HTML relies on this file as the CSS file).
--tab
-t
With this options, tab characters will be converted into specified number of space characters (tabulation points will be preserved). This option is automatically selected when generating line numbers.
--style-file
--style-css-file
Specify the file that source-highlight will use to produce (i.e., format) the output (e.g., colors and other styles for each language element). The formats of these files are detailed in Output format style and in Output format style using CSS, respectively.
--style-defaults
Specify the file that contains the default styles for elements whose styles are not found in the style file (see Default Styles for further details).
--output-dir
You can pass to source-highlight more than one input file (see Simple Usage). In this case you cannot specify the output file name. In such cases the output files will be automatically generated into the directory where you invoked the command from; if you want the output files to be generated into a different directory you can use this option.
--infer-lang
Force the inference mechanism for detecting the input language. This is detailed in How the input language is discovered.
--line-number
Line numbers will be generated in the output, using the (optional) specified padding character20 (the default padding character is 0).
--line-number-ref
As --line-number, this option numbers all the output lines, and, additionally, generates an anchor for each line. The anchor consists of the specified prefix (default is line) and the line number (e.g., line25). For instance, as prefix, if you deal with many files, you can use the file name. Notice that some output languages might not support this feature (e.g., esc, since it makes no sense in such case). See Anchors and References for defining how to generate an anchor in a specific output language.
--failsafe
If no language specification is found, an error will be printed and the program exits. With this option, instead, in such situations, the input is simply formatted in the output format. This is useful when source-highlight is used with many input files, and it is also used in the src-hilite-lesspipe.sh script. Actually I failed to find a good reason why one should not always use this option. So my suggestion is to always use it when you run source-highlight (and indeed, in the future, this option might become the default one). See also Using source-highlight with less, Using source-highlight as a simple formatter.

When using --failsafe, if no input language can be established, source-highlight will use the input language definition file default.lang, which is an empty file. You might want to customize such file, though.

--debug-lang
--show-regex
Allows to debug a language definition file, Debugging.

The other command line options dealing with references are explained in more details in Generating References.


Previous: Invoking source-highlight, Up: Invoking source-highlight

6.1 How the input language is discovered

As already explained, Simple Usage, source-highlight uses a language definition file according the language specified with the option --src-lang, or --lang-def, or by using the input file extension.

Since version 2.5, source-highlight can use an inference mechanism to deduce the input language. For the moment, it can detect script languages based on the “sha-bang” mechanism, i.e., when the first line of a script contains a line such as, e.g.,

     #!/bin/sh

It also detects script languages specified by using the env program21:

     #!/usr/bin/env perl

Finally, it also recognizes the Emacs convention, of declaring the Emacs major mode using the format -*- lang -*-.

For instance, a script starting as the following one:

     #!/bin/bash
     # -*- Tcl -*-

will be interpreted as a Tcl script, and not as bash script.

This inference mechanism is performed, by default, in case the input language is neither explicitly specified nor found in the language map file by using the input file extension (the input file may also have no extension at all).

Furthermore, this mechanism can be given priority with the command line option --infer-lang. For instance, this is used in the script src-hilite-lesspipe.sh (Using source-highlight with less) when running source-highlight, in order to avoid the problem of formatting a Perl script as a Prolog program (since the extension .pl is associated to Prolog programs in the language map file).


Next: , Previous: Invoking source-highlight, Up: Top

7 Language Definitions

Since version 2.0 source-highlight uses a specific syntax to specify source language elements (e.g., keywords, strings, comments, etc.). Before version 2.0, language elements were scanned through Flex. This had the drawback of writing a new flex file to deal with a new language; even worse, a new language could not be added “dynamically”: you had to recompile the whole source-highlight program.

Instead, now, language elements are specified in a file, loaded dynamically, through a (hopefully) simple syntax. Then, these definitions are used internally to create, on-the-fly, regular expressions that are used to highlight the elements. In particular, we use the regular expressions provided by the Boost library (see Installation). Thus, when writing a language definition file you will surely have to deal with regular expressions. Of course, we use the Boost regex library regular expression syntax. We refer to Boost documentation for such syntax, http://www.boost.org/libs/regex/doc/syntax.html, however, in Notes on regular expressions, we provide some notes on regular expressions that might be helpful for those who never dealt with them. By default, Boost regex library uses Perl regular expression syntax, and, at the moment, this is the only syntax supported by source-highlight.

Here, we see such syntax in details, by relying on many examples. This allows a user to easily modify an existing language definition and create a new one. These files have, typically, extension .lang.

Each definition basically associates a regular expression to a language element and defines a name for the language element. Such name will be used to associate a particular style (e.g., bold face, color, etc.) when highlighting of such elements. You cannot use names that are the same of keywords used in the language definition syntax (e.g., start, as shown later, is a reserved word).

Comments can be given by using #; the rest of the line is considered as a comment.

Source-highlight will scan each line of the input file separately. So a regular expression that tries to match new line characters is destined to fail. However, the language definition syntax provides means to deal with multiple lines (see Delimited definitions and State/Environment Definitions).


Next: , Previous: Language Definitions, Up: Language Definitions

7.1 Ways of specifying regular expressions

Before getting into details of language definition syntax, it is crucial to describe the 3 possible ways of specifying a regular expression string. These 3 different ways, basically differ in the way they handle regular expression special characters, such, e.g., parenthesis. For this reason, one mechanism can be more powerful than another one, but it could also require more attention; furthermore, there can be situations where you're forced to use only one mechanism, since the other ones cannot accomplish the required goal.

"expression"
If you use double quotes (notice, " and not `` or '') to specify a regular expression, then basically all the characters, but the alternation symbol, i.e., the pipe symbol |, are considered literally, and thus will be automatically escaped (e.g., a dot . is interpreted as the character . not as the regular expression wild card). Thus, for instance, if you specify
          "my(regular)ex.pre$$ion{*}"
     

source-highlight will automatically transform it into

          my\(regular\)ex\.pre\$\$ion\{\*\}
     

The special character |, unless it is meant to separate two alternatives (Simple definitions), must be escaped with the character \, e.g., \|. Also the character \, if it is intended literally, must be escaped, e.g., \\.


'expression'
If you want to enjoy the full power of regular expressions, you need to use single quoted strings ('), instead of double quoted strings. This way, you can specify special characters with their intended meaning.

However, marked subexpressions are automatically transformed in non marked subexpressions, i.e., the parts in the expression of the shape (...) will be transformed into (?:...) (as explained in Notes on regular expressions, (?:...) lexically groups part of a regular expression, without generating a marked sub-expression).

Thus, for instance, if you specify

          'my(regular)ex.pre$ion*'
     

source-highlight will automatically transform it into

          my(?:regular)ex.pre$ion*
     

Since marked subexpressions cannot be specified with this syntax, then backreferences (see Notes on regular expressions) are not allowed.


`expression`
This syntax22 (notice the difference, this one uses the backtick ` while the previous one uses ') for specifying a regular expression was introduced to overcome the limitations of the other two syntaxes. With this syntax, the marked subexpressions are not transformed, and so you can use regular expressions mechanisms that rely on marked subexpressions, such as backreferences and conditionals (see Notes on regular expressions).

This syntax is also crucial for highlighting specific program parts of some programming languages, such as, e.g., Perl regular expressions (e.g., in substitution expressions) that can be expressed in many forms, in particular, separators for the part to be replaced and the part to replace with can be any non alphanumerical characters23, for instance,

          s/foo/bar/g
          s|foo|bar|g
          s#foo#bar#g
          s@foo@bar@g
     

Using this syntax, and backreferences, we can easily define a single language element to deal with these expressions (without specify all the cases for each possible non alphanumerical character):

          regexp = `s([^[:alnum:][:blank:]]).*\1.*\1[ixsmogce]*`
     


Next: , Previous: Ways of specifying regular expressions, Up: Language Definitions

7.2 Simple definitions

The simplest way of specify language elements is to list the possible alternatives. This is the case, for instance, for keywords. For instance, in java.lang you have:

     keyword = "abstract|assert|break|case|catch|class|const",
               "continue|default|do|else|extends|false|final",
               "finally|for|goto|if|implements|instanceof|interface"
     keyword = "native|new|null|private|protected|public|return",
               "static|strictfp|super|switch|synchronized|throw",
               "throws|true|this|transient|try|volatile|while"

You can separate quoted definitions with commas. Alternatively, within a quoted definition, alternatives can be separated with the pipe symbol |. The above definition defines the language element keyword. Each time an element is found in the source file, it is highlighted with the style for the element with the same name in the output format style file (notice that all elements shown in the example are take from the language definition files that come with source-highlight and there is a style for each of such elements, see Configuration files). If such an element is not specified in the output format style file, it is simply not highlighted (actually, it is highlighted with style normal, Configuration files) (so pay attention to typos :-).

From the above example you may have noticed that language element definitions are cumulative, so the second keyword definition does not replace the first one. (Indeed, in some cases you may want to actually redefine a language element; this is possible as explained in Redefinitions and Substitutions).

Notice that words specified in double quotes have to match exactly in a source file, and they must be isolated (not surrounded by anything but spaces). Thus for instance class is matched as a keyword, but in my_class the substring class is not matched as keyword. From the point of view of regular expressions a string such as class in a double quote simple definition is intended as \<(class)\>.

Special characters have to be escaped with the character \. So for instance if you want to specify the character |, which is normally used to separate alternatives in double quoted strings, you have to specify \|.

As explained in Ways of specifying regular expressions, definitions in double quotes are interpreted literally (thus, e.g., a dot . is interpreted as the character . not as the regular expression wild card). If you want to enjoy the full power of regular expressions to specify a language alternative, you have to use single quoted strings ('), instead of double quoted strings.

For instance, the following is the definition for a preprocessor directive in C/C++:

     preproc = '^[[:blank:]]*#([[:blank:]]*[[:word:]]*)'

Notice that the definition 'class' is different from "class", as explained above. Thus, for instance 'class' matches also the sub-expression class inside my_class.

Finally, at the end of a list of definitions, one can specify the keyword nonsensitive; in that case, the specified strings will be interpreted in a non case sensitive way. For instance, we use this feature in Pascal language definition, pascal.lang where keywords are parsed in a non sensitive way:

     keyword = "alfa|and|array|begin|case|const|div",
           "do|downto|else|end|false|file|for|function|get|goto|if|in",
           "label|mod|new|not|of|or|pack|packed|page|program",
           "put|procedure|read|readln|record|repeat|reset|rewrite|set",
           "text|then|to|true|type|unpack|until|var|while|with|writeln|write"
       nonsensitive


Next: , Previous: Simple definitions, Up: Language Definitions

7.3 Line wide definitions

It is often useful to define a language element that affects all the remaining characters up to the end of the line. For such definitions, instead of the = you must use the keyword start. For instance, the following is the definition of a single line comment in C++:

     comment start "//"

This says that when the two characters // are encountered in the source file, everything from these characters, include, up to the end of the line, will be highlighted according to the style comment.


Next: , Previous: Line wide definitions, Up: Language Definitions

7.4 Order of definitions

It is important to observe that the order of language definitions is important since it will be used during regular expression matching. You then have to make sure that, if there are definitions that start with same characters, the longest expression is specified first in the file. For instance if you write

     symbol = "/"
     comment start "//"

The first expression will always be matched first, and the second expression will never be matched. The right order is

     comment start "//"
     symbol = "/"


Next: , Previous: Order of definitions, Up: Language Definitions

7.5 Delimited definitions

Many elements are delimited by specific character sequences. For instance, strings and multiline comments. The syntax for such an element definition is

     <name> delim <left delimited> <right delimiter> \
             {escape <escape character>} \
             {multiline} {nested}

The escape specification allows to specify the escape character that may precede one of the delimiters inside the element. This is optional.

For instance, this is the definition of C-like strings:

     string delim "\"" "\"" escape "\\"

Notice that \ is a special characters in definitions so it has to be escaped. If the escape specification was omitted, the C string "write \"hello\" string" would have been highlight incorrectly (it would have been highlighted as the string "write \", the normal character sequence hello\ and the string " string").

The option multiline specifies that the element can spawn multiple lines. For instance, PHP strings are defined as follows:

     string delim "\"" "\"" escape "\\" multiline

The option nested instructs to count possible multiple occurrences of delimited characters and to match relative multiple occurrences (using a stack). For instance, if we wanted to highlight C-like multiline comments in a nested way24, we could use the following definition:

     comment delim "/*" "*/" multiline nested

If nested was not used, then the closing */ of the following nested comment would conclude the comment (and the second */ would not be highlighted as a comment):

     /*
        This is a /* nested comment */
     */

As said above, definitions are cumulative, and they are also cumulative even when using different syntactic forms. Thus, for instance, the complete definition for C++-style comments are the following (actually, the definition of C-style comment is more involved, see the file c_comment.lang):

     comment start "//"
     comment delim "/*" "*/" multiline


Next: , Previous: Delimited definitions, Up: Language Definitions

7.6 Variable definitions

It is possible to define variables to be re-used in many parts in a language definition file. A variable is defined by using

vardef <name of the variable> = <list of definitions>

Once defined, a variable can be used by prepending the symbol $ to its name. For instance,

     vardef FUNCTION = '(?:[[:alpha:]]|_)[[:word:]]*[[:blank:]]*(?=\()'
     function = $FUNCTION

The capital letters are used only for readability.

It is also possible to concatenate variables and expressions, and reuse variables inside further variable definitions:

     vardef basic_time = '[[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}'
     vardef time = '\<' + $basic_time + '\>'


Next: , Previous: Variable definitions, Up: Language Definitions

7.7 Dynamic Backreferences

With dynamic backreferences you can refer to a string matched by the regular expression of the first element of a delim specification25. I called these backreferences dynamic in order to distinguish them by the backreferences of regular expression syntax, Ways of specifying regular expressions. This is crucial in cases when the rigth delimiter depends on a subexpression matched by the left delimiter; for instance, Lua comments can be of the shape --[[ comment ]] or --[=[ comment ]=], but not --[=[ comment ]] neither --[[ comment ]=] (furthermore, they can be nested)26. Thus, the regular expression of the right element depends on the one of the left element.

A dynamic backreference is similar to a variable (Variable definitions), but there's no declaration, and have the shape of

     @{number}

where number is the number of the marked subexpression in the left delimiter (source-highlight will actually check that such a marked subexpression exists in the left delimiter).

For instance, this is the definition of Lua comments (see also lua.lang):

     environment comment delim `--\[(=*)\[` "]" + @{1} + "]"
                 multiline nested begin
       include "url.lang"
       ...
     end

Notice how the left delimiter can match an optional =, as a marked subexpression, and the right delimiter refers to that with @{1}.

Source-highlight will take care of escaping possible special characters during dynamic backreference substitutions. For instance, suppose that you must substitute | for @{1}, because we matched | with the subexpression [^[:alnum:]] in a delim element like the following one:

     comment delim `([^[:alnum:]])` @{1}

Since | is a special character in regular expression syntax source-highlight will actually replace @{1} with \|.

IMPORTANT: the right delimiter can only refer to subexpressions of its left delimiter; thus, in case of nested delim element definitions (e.g., in states or environment, State/Environment Definitions), the left delimiter acts as a binder and hides possible subexpressions defined in outer delim elements.


Next: , Previous: Dynamic Backreferences, Up: Language Definitions

7.8 File inclusion

It is possible to include other language definition files into another file. This is inclusion actually physically includes the contents of the included file into the current file during parsing, at the exact point of inclusion (just like the #include in C/C++). This is useful for re-using definitions in many files. For instance, C++ comment definitions are given in a file c_comment.lang, and this file is included in the Java and C++ definition files. The same happens for number and functions. For instance, the file java.lang contains the following include instructions:

     include "c_comment.lang"
     
     include "number.lang"
     
     keywords ...
     
     include "function.lang"

Notice that the order of inclusion is crucial since the order of definition is crucial. If function definition was included before keyword definitions, then the sentence if (exp) would be highlighted as a function invocation.


Next: , Previous: File inclusion, Up: Language Definitions

7.9 State/Environment Definitions

Sometimes you want some source element to be highlighted only if they are surrounded by other elements. Source-highlight language definitions provides also this feature.

     state|environment <standard definition> begin
       <other definitions>
     end

This structure is recursive (so other state/environment definitions can be given within a state/environment). The meaning of a state/environment is that the definitions within the begin ... end are matched only if the definitions that define the state/environment have been matched. When entering a state/environment, however, the definitions given outside the state/environment are not matched. The difference between state and environment is that in the latter, normal parts of the source language (i.e., those that do not match any definition) are highlighted according to the style of the definition that defines the environment.

As an example, the following defines the multiline nested C comment, and highlights URL and e-mail addresses only when they appear inside a comment (notice that this uses file inclusion):

     environment comment delim "/*" "*/" multiline nested begin
           include "url.lang"
     end

Notice that we used environment because everything else inside a comment has to be formatted according to the comment style.

While for programming language definitions states/environments can be avoided (although they allow to highlight some parts only if inside a specific environment, e.g., URLs inside comments, or documentation tags in Javadoc comments), they are pretty important for highlighting files such as logs and ChangeLog files, since elements have to be highlighted when they appear in a specific position. For instance, for ChangeLog (see changelog.lang), we use a state for highlighting the date, name, e-mail or URL (taken from url.lang):

     state date start '[[:digit:]]{2,4}-?[[:digit:]]{2}-?[[:digit:]]{2}' begin
       include "url.lang"
       name = '([[:word:]]|[[:punct:]])+'
     end

Notice that definitions that appear inside a state/environment have the same scope of the expressions that define the environment. While this makes sense for start and delim definitions, it may makes less sense for simple definitions (i.e., those that simply lists all possible expressions): in fact, in this case, such expressions do not define a scope. For such definitions, the semantics of state/environment is that the state/environment starts after matching one of the alternatives. And where will it end? In this case you must explicitly exit the environment. For instance, you can say that, when inside a state/environment, a specific language definition, when encountered also exits the environment (with the keyword exit). You can even exit all the environments with exitall. For instance, the following definition, highlights a non empty string following a web method:

     vardef non_empty = '[^[:blank:]]+'
     
     state webmethod = "OPTIONS|GET|HEAD|POST|PUT|DELETE",
               "TRACE|CONNECT|PROPFIND|MKCOL|COPY|MOVE|LOCK|UNLOCK" begin
       string = $non_empty exit
     end

If you ever need such advanced features, you may want to take a look at the log.lang definition file that defines highlighting for several log files (access logs, Apache logs, etc.).


Next: , Previous: State/Environment Definitions, Up: Language Definitions

7.10 Explicit subexpressions with names

Often, you need to specify two program elements in the same regular expressions, because they are tightly related, but you also need to highlight them differently.

For instance, you might want to highlight the name of a class (or interface) in a class (or interface) definition (e.g., in Java). Thus, you can rely on the preceding class keyword which will then be followed by an identifier.

A definition such as

     keyword = '(\<(?:class|interface))([[:blank:]]+)([$[:alnum:]]+)'

will not produce a good final result, since the name of the class will be highlighted as a keyword, which is not what you might have wanted: for instance, the class name should be highlighted as a type.

Up to version 2.6, the only way to do this was to use state or environments (State/Environment Definitions) but this tended to be quite difficult to write.

Since version 2.7, you can specify a regular expression with marked subexpressions and bind each of them to a specific language element (the regular expression must be enclosed in `, see Ways of specifying regular expressions):

     (elem1,...,elemn) = `(subexp1)(...)(subexpn)`

Now, with this syntax, we can accomplish our previous goal:

     (keyword,normal,type) =
       `(\<(?:class|interface))([[:blank:]]+)([$[:alnum:]]+)`

This way, the class (or interface) will be highlighted as a keyword, the separating blank characters are formatted as normal, and the name of the class as a type.

Notice that the number of element names must be equal to the number of subexpressions in the expression; furthermore, at least in the current version, the expression can contain only marked subexpressions (no character outside is allowed) and no nested subexpressions are allowed.

Thus, the following specifications are NOT correct:

     (keyword,symbol) = `(...)(...)(...)` # number of elements doesn't match
     (keyword,symbol) = `(...(...)...)(...)` # contains nested subexpressions
     (keyword,symbol) = `...(...)...(...)` # outside characters

This mechanism permits expressing regular expressions for some situation in a much more compact and probably more readable way. For instance, for highlighting ChangeLog parts (the optional * as a symbol, the optional file name and the element specified in parenthesis as a file element, and the rest as normal) such as

       * src/Makefile.am (source_highlight_SOURCES): correctly include
       changelog_scanner.ll
     
       * this is a comment without a file name

before version 2.6, we used to use these two language definitions:

     state symbol start '^(?:[[:blank:]]+)\*[[:blank:]]+' begin
       state file start '[^:]+\:' begin
         normal start '.'
       end
     end
     
     state normal start '^(?:[[:blank:]]+)' begin
       state file start '[^:]+\:' begin
         normal start '.'
       end
     end

which can be hard to read after having written them. Now, we can write them more easily (see changelog.lang):

     (normal,symbol,normal,file)=
       `(^[[:blank:]]+)(\*)([[:blank:]]+)((?:[^:]+\:)?)`
     (normal,file)= `(^[[:blank:]]+)((?:[^:]+\:)?)`


Next: , Previous: Explicit subexpressions with names, Up: Language Definitions

7.11 Redefinitions and Substitutions

These two features are useful when you want to define a language by re-using an existing language definition with some changes. Typically you include another language definition file and you redefine/substitute some elements.

When you use redef you erase all the previous definitions of that language elements with the new one. The new language element definition will be placed exactly in the point of the new definition. We use this feature, for instance, when we define the sml language by re-using the caml one: they differ only for the keywords27. In fact, the contents of sml.lang is summarized as follows:

     include "caml.lang"
     
     redef keyword = "abstraction|abstype|and|andalso..."
     
     redef type = "int|byte|boolean|char|long|float|double|short|void"

Since the new language element definition appears in the exact point of the redefinition, this means that such a regular expression will be matched only if all the previous ones (the ones of the included file) cannot be matched. This may lead to unwanted results in some cases (not in the sml case though). In other words the following code

     keyword = "foo"
     keyword = "bar"
     type = "int"
     redef keyword = "myfoo"

is equivalent to the following one

     type = "int"
     keyword = "myfoo"

If this is not what you want, you can use subst, which is similar to redef apart from that it replaces the previous first definition of that language element in the exact point of that first definition (all other possible definitions are simply erased). That is to say that the following code

     keyword = "foo"
     keyword = "bar"
     type = "int"
     subst keyword = "myfoo"

is equivalent to the following one

     keyword = "myfoo"
     type = "int"

It is up to you to decide which one fits best your needs. We use this feature to define javascript in terms of java:

     include "java.lang"
     
     subst ke