Extract Program (The GNU Awk User’s Guide)

Next: A Simple Stream Editor, Previous: Removing Duplicates from Unsorted Text, Up: A Grab Bag of awk Programs [Contents][Index]

11.3.7 Extracting Programs from Texinfo Source Files ¶

Both this chapter and the previous chapter (A Library of awk Functions) present a large number of awk programs. If you want to experiment with these programs, it is tedious to type them in by hand. Here we present a program that can extract parts of a Texinfo input file into separate files.

This Web page is written in Texinfo, the GNU Project’s document formatting language. A single Texinfo source file can be used to produce both printed documentation, with TeX, and online documentation. (Texinfo is fully documented in the book Texinfo—The GNU Documentation Format, available from the Free Software Foundation, and also available online.)

For our purposes, it is enough to know three things about Texinfo input files:

The “at” symbol (‘@’) is special in Texinfo, much as the backslash (‘\’) is in C or awk. Literal ‘@’ symbols are represented in Texinfo source files as ‘@@’.
Comments start with either ‘@c’ or ‘@comment’. The file-extraction program works by using special comments that start at the beginning of a line.
Lines containing ‘@group’ and ‘@end group’ commands bracket example text that should not be split across a page boundary. (Unfortunately, TeX isn’t always smart enough to do things exactly right, so we have to give it some help.)

The following program, extract.awk, reads through a Texinfo source file and does two things, based on the special comments. Upon seeing ‘@c system …’, it runs a command, by extracting the command text from the control line and passing it on to the system() function (see Input/Output Functions). Upon seeing ‘@c file filename’, each subsequent line is sent to the file filename, until ‘@c endfile’ is encountered. The rules in extract.awk match either ‘@c’ or ‘@comment’ by letting the ‘omment’ part be optional. Lines containing ‘@group’ and ‘@end group’ are simply removed. extract.awk uses the join() library function (see Merging an Array into a String).

The example programs in the online Texinfo source for GAWK: Effective AWK Programming (gawktexi.in) have all been bracketed inside ‘file’ and ‘endfile’ lines. The gawk distribution uses a copy of extract.awk to extract the sample programs and install many of them in a standard directory where gawk can find them. The Texinfo file looks something like this:

...
This program has a @code{BEGIN} rule
that prints a nice message:

@example
@c file examples/messages.awk
BEGIN @{ print "Don't panic!" @}
@c endfile
@end example

It also prints some final advice:

@example
@c file examples/messages.awk
END @{ print "Always avoid bored archaeologists!" @}
@c endfile
@end example
...

extract.awk begins by setting IGNORECASE to one, so that mixed upper- and lowercase letters in the directives won’t matter.

The first rule handles calling system(), checking that a command is given (NF is at least three) and also checking that the command exits with a zero exit status, signifying OK:

# extract.awk --- extract files and run programs from Texinfo files

BEGIN    { IGNORECASE = 1 }

/^@c(omment)?[ \t]+system/ {
    if (NF < 3) {
        e = ("extract: " FILENAME ":" FNR)
        e = (e  ": badly formed `system' line")
        print e > "/dev/stderr"
        next
    }
    $1 = ""
    $2 = ""
    stat = system($0)
    if (stat != 0) {
        e = ("extract: " FILENAME ":" FNR)
        e = (e ": warning: system returned " stat)
        print e > "/dev/stderr"
    }
}

The variable e is used so that the rule fits nicely on the screen.

The second rule handles moving data into files. It verifies that a file name is given in the directive. If the file named is not the current file, then the current file is closed. Keeping the current file open until a new file is encountered allows the use of the ‘>’ redirection for printing the contents, keeping open-file management simple.

The for loop does the work. It reads lines using getline (see Explicit Input with getline). For an unexpected end-of-file, it calls the unexpected_eof() function. If the line is an “endfile” line, then it breaks out of the loop. If the line is an ‘@group’ or ‘@end group’ line, then it ignores it and goes on to the next line. Similarly, comments within examples are also ignored.

Most of the work is in the following few lines. If the line has no ‘@’ symbols, the program can print it directly. Otherwise, each leading ‘@’ must be stripped off. To remove the ‘@’ symbols, the line is split into separate elements of the array a, using the split() function (see String-Manipulation Functions). The ‘@’ symbol is used as the separator character. Each element of a that is empty indicates two successive ‘@’ symbols in the original line. For each two empty elements (‘@@’ in the original file), we have to add a single ‘@’ symbol back in.

When the processing of the array is finished, join() is called with the value of SUBSEP (see Multidimensional Arrays), to rejoin the pieces back into a single line. That line is then printed to the output file:

/^@c(omment)?[ \t]+file/ {
    if (NF != 3) {
        e = ("extract: " FILENAME ":" FNR ": badly formed `file' line")
        print e > "/dev/stderr"
        next
    }
    if ($3 != curfile) {
        if (curfile != "")
            filelist[curfile] = 1   # save to close later
        curfile = $3
    }

    for (;;) {
        if ((getline line) <= 0)
            unexpected_eof()
        if (line ~ /^@c(omment)?[ \t]+endfile/)
            break
        else if (line ~ /^@(end[ \t]+)?group/)
            continue
        else if (line ~ /^@c(omment+)?[ \t]+/)
            continue
        if (index(line, "@") == 0) {
            print line > curfile
            continue
        }
        n = split(line, a, "@")
        # if a[1] == "", means leading @,
        # don't add one back in.
        for (i = 2; i <= n; i++) {
            if (a[i] == "") { # was an @@
                a[i] = "@"
                if (a[i+1] == "")
                    i++
            }
        }

        print join(a, 1, n, SUBSEP) > curfile
    }
}

An important thing to note is the use of the ‘>’ redirection. Output done with ‘>’ only opens the file once; it stays open and subsequent output is appended to the file (see Redirecting Output of print and printf). This makes it easy to mix program text and explanatory prose for the same sample source file (as has been done here!) without any hassle. The file is only closed when a new data file name is encountered or at the end of the input file.

When a new file name is encountered, instead of closing the file, the program saves the name of the current file in filelist. This makes it possible to interleave the code for more than one file in the Texinfo input file. (Previous versions of this program did close the file. But because of the ‘>’ redirection, a file whose parts were not all one after the other ended up getting clobbered.) An END rule then closes all the open files when processing is finished:

END {
    close(curfile)          # close the last one
    for (f in filelist)     # close all the rest
        close(f)
}

Finally, the function unexpected_eof() prints an appropriate error message and then exits:

function unexpected_eof()
{
    printf("extract: %s:%d: unexpected EOF or error\n",
                     FILENAME, FNR) > "/dev/stderr"
    exit 1
}