[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4. Extending combine

If combine was built with Guile (GNU’s Ubiquitous Intelligent Language for Extensibility), you can do anything you want (within reason) to extend combine. This would have been set up when combine was compiled and installed on your computer. In a number of places, there are built-in opportunities to call Guile with the data that is currently in process. Using these options, you can use your favorite modules or write your own functions in scheme to manipulate the data and to adjust how combine operates on it.

The most common method (in my current usage) of extending combine is to alter the values of fields from the input files before they are used for matching or for output. This is done inside the field list by adding the scheme statement after the range and precision. This is covered in the section on field specifications. See section Field-specific extensions, for details.

Another useful option is the ability to initialize Guile with your own program. To do this, you can use the ‘--extension-init-file’ (or ‘-X’) followed by a file name. combine will load that scheme file into Guile before any processing. In that way your functions will be available when you need them in the running of the program. It certainly beats writing something complicated on the command line.

In addition, there are Guile modules included in the distribution, which can be used in extension scripts.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.1 Extension Options

The remaining extensibility options are called at various points in the program: when it starts, when a file is started, when a match is found, when a record is read, when a record is written, when a file is closed, and at the very end of the program. The options are listed below along with the way to get access to the relevant data.

The various non-field=specific options are as follows. They all occur as arguments to the option ‘--extension’ (or ‘-x’).

lscheme-command

Filter records from the current file using the scheme command provided. The scheme command must return ‘#t’ (to keep processing the record) or ‘#f’ (to ignore this record and move on to the next). The variables ‘reference-field-n’ or ‘data-field-n’ will be available to the scheme command, depending on whether the record to be filtered is from the data file or a reference file. In the variable names ‘n’ represents the number of the specified output field, numbered from 1.

mscheme-command

Validate a proposed match using the scheme command provided. The scheme command must return ‘#t’ (to confirm that this is a good match) or ‘#f’ (to tell combine that this is not a match). The variables ‘reference-field-n’ and ‘data-field-n’ will be available to the scheme command from the reference and data records involved in a particular match. In the variable names ‘n’ represents the number of the specified output field, numbered from 1. The extension specification affects the match between the data file and the last named reference file.

hscheme-command

Validate a proposed match between two records in the same hierarchy using the scheme command provided. The scheme command must return ‘#t’ (to confirm that this is a good match) or ‘#f’ (to tell combine that this is not a match). The variables ‘reference-field-n’ and ‘prior-reference-field-n’ will be available to the scheme command from the prior and current reference records involved in a particular match. In the variable names ‘n’ represents the number of the specified output field, numbered from 1. The extension specification affects the match while traversing the hierarchs in the last named reference file.

rscheme-command

Modify a record that has just been read using the scheme command provided. The scheme command must return a string, which will become the new value of the input record to be processed. The input record iteself can be referred to in the scheme command by using the variable ‘input-record’ in the scheme command at the right place. The records affected by this option are the records from the most recently named reference file, or from the data file if no reference file has yet been named.

As an example, consider that you may have received a file from someone who strips all the trailing spaces from the end of a record, but you need to treat it with a fixed-width record layout. Assuming that you have defined a scheme function rpad in the initialization file ‘util.scm’, you can use the following command to get at the field in positions 200-219, with spaces in place of the missing rest of the record.

 
combine -X util.scm -x 'r(rpad input-record 219 #\space)' \
            -o 200-219 trimmed_file.txt

The same syntax works with the other ‘--extension’ options.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.2 Guile Modules

Here we talk about Guile modules that are distributed with combine. At the moment, those are limited to date processing.

In addition, the file ‘util.scm’ in the distribution contains a few functions I have found handy. They are not documented here, and the file doesn’t get installed automatically.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.2.1 Calendar Functions

Included in the combine package are two Guile modules to work with dates from a number of calendars, both obscure and common. The basis for them is the set of calendar functions that are shipped with Emacs.

The reason that these functions deserve special notice here is that date comparisons are a common type of comparison that often cannot be made directly on a character string. For example I might have trouble knowing if "20030922" is the same date as "22 September 2003" if I compared strings; however, comparing them as dates allows me to find a match. We can even compare between calendars, ensuring that "1 Tishri 5764" is recognized as the same date as "20030927".

The calendar module can be invoked as (use-modules (combine_scm calendar)). It provides functions for converting from a variety of calendars to and from and absolute date count, whose 0-day is the imaginary date 31 December 1 B.C. In the functions, the absolute date is treated as a single number, and the dates are lists of numbers in (month day year) format unless otherwise specified.

The calendar functions are as follow:


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.2.2 Calendar Reference

Here are some variables that can be used as references to get names associated with the numbers that the date conversion functions produce for months.

gregorian-day-name-alist

An associative list giving the weekdays in the Gregorian calendar in a variety of languages. Each element of this list is a list composed of a 2-letter language code (lowercase) and a list of 7 day names.

gregorian-month-name-alist

An associative list giving the months in the Gregorian calendar in a variety of languages. Each element of this list is a list composed of a 2-letter language code (lowercase) and a list of 12 month names.

calendar-islamic-month-name-array

A list of the months in the Islamic calendar.

calendar-hebrew-month-name-array-common-year

A list of the months in the standard Hebrew calendar.

calendar-hebrew-month-name-array-leap-year

A list of the months in the leap year Hebrew calendar.

chinese-calendar-celestial-stem
chinese-calendar-terrestrial-branch
lunar-phase-name-alist
solar-n-hemi-seasons-alist
solar-s-hemi-seasons-alist
french-calendar-month-name-array

A list of the months in the French Revolutionary calendar.

french-calendar-multibyte-month-name-array

A list of the months in the French Revolutionary calendar, using multibyte codes to represent the accented characters.

french-calendar-day-name-array

A list of the days in the French Revolutionary calendar.

french-calendar-multibyte-special-days-array

A list of the special days (non weekdays) in the French Revolutionary calendar, using multibyte codes to represent the accented characters.

french-calendar-special-days-array

A list of the special days (non weekdays) in the French Revolutionary calendar.

coptic-calendar-month-name-array

A list of the months in the Coptic calendar.

ethiopic-calendar-month-name-array

A list of the months in the Ethiopic calendar.

persian-calendar-month-name-array

A list of the months in the Persian calendar.

calendar-mayan-haab-month-name-array
calendar-mayan-tzolkin-names-array

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.2.3 Calendar Parsing

The calendar parsing module can be invoked as (use-modules (combine_scm parse)).

The most useful function in the module is parse-date. It takes as arguments a date string and an output format. The date string is parsed as well as possible in descending order of preference for format in case of ambiguity. The function returns the date triplet (or other such representation) suggested by the format string.

The supported format strings are the words in the function names of the form calendar-xxxx-from-absolute that would take the place of the xxxx. See section Calendar Functions, for more information.

The parsing of the date string depends on the setting of a couple of variables. Look inside the file ‘parse.scm’ for details. The list parse-date-expected-order lists the order in which the parser should look for the year, month, and day in case of ambiguity. The list parse-date-method-preference give more general format preferences, such as 8-digit, delimited, or a word for the month and the expected incoming calendar.

Here are a few examples of passing a date and putting it out in some formats:

 
guile> (use-modules (combine_scm parse))
guile> (parse-date "27 September 2003" "gregorian")
(9 27 2003)
guile> (parse-date "27 September 2003" "julian")
(9 14 2003)

The 13 day difference in the calendars is the reason that the Orthodox Christmas is 2 weeks after the Roman Catholic Christmas.

 
guile> (parse-date "27 September 2003" "hebrew")
(7 1 5764)

Note that the Hebrew date is Rosh HaShannah, the first day of the year 5764. The reason that the month is listed as 7 rather than 1 is inherited from the Emacs calendar implementation. Using the month list in calendar-hebrew-month-name-array-common-year or calendar-hebrew-month-name-array-leap-year correctly gives "Tishri", but since the extra month (in years that have it) comes mid-year, the programming choice that I carried forward was to cycle the months around so that the extra month would come at the end of the list.

 
guile> (parse-date "27 September 2003" "islamic")
(7 30 1424)
guile> (parse-date "27 September 2003" "iso")
(39 6 2003)

This is the 6th day (Saturday) of week 39 of the year.

 
guile> (parse-date "27 September 2003" "mayan-long-count")
(12 19 10 11 7)

I won’t get into the detail, but the five numbers reflect the date in the Mayan calendar as currently understood.

Generally, I’d recommend using the more specific functions if you are sure of the date format you expect. For comparing dates, I would further recommend comparing the absolute day count rather than any more formatted format.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Daniel P. Valentine on July 28, 2013 using texi2html 1.82.