Next: , Previous: Extension Version String, Up: Registration Functions


16.4.5.4 Customized Input Parsers

By default, gawk reads text files as its input. It uses the value of RS to find the end of the record, and then uses FS (or FIELDWIDTHS or FPAT) to split it into fields (see Reading Files). Additionally, it sets the value of RT (see Built-in Variables).

If you want, you can provide your own custom input parser. An input parser's job is to return a record to the gawk record processing code, along with indicators for the value and length of the data to be used for RT, if any.

To provide an input parser, you must first provide two functions (where XXX is a prefix name for your extension):

awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf)
This function examines the information available in iobuf (which we discuss shortly). Based on the information there, it decides if the input parser should be used for this file. If so, it should return true. Otherwise, it should return false. It should not change any state (variable values, etc.) within gawk.
awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf)
When gawk decides to hand control of the file over to the input parser, it calls this function. This function in turn must fill in certain fields in the awk_input_buf_t structure, and ensure that certain conditions are true. It should then return true. If an error of some kind occurs, it should not fill in any fields, and should return false; then gawk will not use the input parser. The details are presented shortly.

Your extension should package these functions inside an awk_input_parser_t, which looks like this:

     typedef struct awk_input_parser {
         const char *name;   /* name of parser */
         awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
         awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
         awk_const struct awk_input_parser *awk_const next;   /* for gawk */
     } awk_input_parser_t;

The fields are:

const char *name;
The name of the input parser. This is a regular C string.
awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
A pointer to your XXX_can_take_file() function.
awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
A pointer to your XXX_take_control_of() function.
awk_const struct input_parser *awk_const next;
This pointer is used by gawk. The extension cannot modify it.

The steps are as follows:

  1. Create a static awk_input_parser_t variable and initialize it appropriately.
  2. When your extension is loaded, register your input parser with gawk using the register_input_parser() API function (described below).

An awk_input_buf_t looks like this:

     typedef struct awk_input {
         const char *name;       /* filename */
         int fd;                 /* file descriptor */
     #define INVALID_HANDLE (-1)
         void *opaque;           /* private data for input parsers */
         int (*get_record)(char **out, struct awk_input *iobuf,
                           int *errcode, char **rt_start, size_t *rt_len);
         ssize_t (*read_func)();
         void (*close_func)(struct awk_input *iobuf);
         struct stat sbuf;       /* stat buf */
     } awk_input_buf_t;

The fields can be divided into two categories: those for use (initially, at least) by XXX_can_take_file(), and those for use by XXX_take_control_of(). The first group of fields and their uses are as follows:

const char *name;
The name of the file.
int fd;
A file descriptor for the file. If gawk was able to open the file, then fd will not be equal to INVALID_HANDLE. Otherwise, it will.
struct stat sbuf;
If file descriptor is valid, then gawk will have filled in this structure via a call to the fstat() system call.

The XXX_can_take_file() function should examine these fields and decide if the input parser should be used for the file. The decision can be made based upon gawk state (the value of a variable defined previously by the extension and set by awk code), the name of the file, whether or not the file descriptor is valid, the information in the struct stat, or any combination of the above.

Once XXX_can_take_file() has returned true, and gawk has decided to use your input parser, it calls XXX_take_control_of(). That function then fills one of either the get_record field or the read_func field in the awk_input_buf_t. It must also ensure that fd is not set to INVALID_HANDLE. All of the fields that may be filled by XXX_take_control_of() are as follows:

void *opaque;
This is used to hold any state information needed by the input parser for this file. It is “opaque” to gawk. The input parser is not required to use this pointer.
int (*get_record)(char **out,
struct awk_input *iobuf,
int *errcode,
char **rt_start,
size_t *rt_len);
This function pointer should point to a function that creates the input records. Said function is the core of the input parser. Its behavior is described below.
ssize_t (*read_func)();
This function pointer should point to function that has the same behavior as the standard POSIX read() system call. It is an alternative to the get_record pointer. Its behavior is also described below.
void (*close_func)(struct awk_input *iobuf);
This function pointer should point to a function that does the “tear down.” It should release any resources allocated by XXX_take_control_of(). It may also close the file. If it does so, it should set the fd field to INVALID_HANDLE.

If fd is still not INVALID_HANDLE after the call to this function, gawk calls the regular close() system call.

Having a “tear down” function is optional. If your input parser does not need it, do not set this field. Then, gawk calls the regular close() system call on the file descriptor, so it should be valid.

The XXX_get_record() function does the work of creating input records. The parameters are as follows:

char **out
This is a pointer to a char * variable which is set to point to the record. gawk makes its own copy of the data, so the extension must manage this storage.
struct awk_input *iobuf
This is the awk_input_buf_t for the file. The fields should be used for reading data (fd) and for managing private state (opaque), if any.
int *errcode
If an error occurs, *errcode should be set to an appropriate code from <errno.h>.
char **rt_start
size_t *rt_len
If the concept of a “record terminator” makes sense, then *rt_start should be set to point to the data to be used for RT, and *rt_len should be set to the length of the data. Otherwise, *rt_len should be set to zero. gawk makes its own copy of this data, so the extension must manage the storage.

The return value is the length of the buffer pointed to by *out, or EOF if end-of-file was reached or an error occurred.

It is guaranteed that errcode is a valid pointer, so there is no need to test for a NULL value. gawk sets *errcode to zero, so there is no need to set it unless an error occurs.

If an error does occur, the function should return EOF and set *errcode to a non-zero value. In that case, if *errcode does not equal −1, gawk automatically updates the ERRNO variable based on the value of *errcode. (In general, setting ‘*errcode = errno’ should do the right thing.)

As an alternative to supplying a function that returns an input record, you may instead supply a function that simply reads bytes, and let gawk parse the data into records. If you do so, the data should be returned in the multibyte encoding of the current locale. Such a function should follow the same behavior as the read() system call, and you fill in the read_func pointer with its address in the awk_input_buf_t structure.

By default, gawk sets the read_func pointer to point to the read() system call. So your extension need not set this field explicitly.

NOTE: You must choose one method or the other: either a function that returns a record, or one that returns raw data. In particular, if you supply a function to get a record, gawk will call it, and never call the raw read function.

gawk ships with a sample extension that reads directories, returning records for each entry in the directory (see Extension Sample Readdir). You may wish to use that code as a guide for writing your own input parser.

When writing an input parser, you should think about (and document) how it is expected to interact with awk code. You may want it to always be called, and take effect as appropriate (as the readdir extension does). Or you may want it to take effect based upon the value of an awk variable, as the XML extension from the gawkextlib project does (see gawkextlib). In the latter case, code in a BEGINFILE section can look at FILENAME and ERRNO to decide whether or not to activate an input parser (see BEGINFILE/ENDFILE).

You register your input parser with the following function:

void register_input_parser(awk_input_parser_t *input_parser);
Register the input parser pointed to by input_parser with gawk.