combine Manual 0.4.0: 2.5 Output Files

2.5 Output Files

There are two basic kinds of output files: one based on the data records and reference records that match them, the other based on a full set of records from one reference file with flags, counts, or sums based on the aggregate of the matching data records.

The output file based on the data file consists of information from the data file records and any matching reference file records. The records that go into data-based output files can be figured out as follows:

no reference file: If there is no reference file, there will be one record for every record in the data file, with the exception of any records that were elimitated through an extension filter. (see section Extending combine.)
reference files with ‘--unique’ and ‘--match-optional’ options: If all reference files are specified with the ‘--unique’ and ‘--match-optional’ options, then the records selected for insertion into the data-based output file will be the same as those that would be selected without a reference file.
reference files without the ‘--unique’ option: If a reference file is not given the ‘--unique’ option and there is more than one reference record that matches a given data record, then the data record will be represented more than once in the output file, each time combined with information from a different matching reference record. If there is more than one reference file where this is the case, the result will be multiplicative (e.g. 2 matches in each of 2 reference files will produce 4 records). This is the default setting for a reference file.
reference files without the ‘--match-optional’ option: If a reference file is not given the ‘--match-optional’ option, then any data record that does not have a match in the reference file will not be represented in the output file. This is the default setting.

The fields that can appear in data=file-based output can come from the data-file record and any matching reference file records.

Reference=file-based output files are simpler. Depending on the existence or not of the ‘--unique’ option, the file will have an entry for each of the unique keys or for each of the records in the reference file, respectively.

The fields in the reference=file-based output are exclusively from the reference file, except for optional fields that can be summarized from fields on matching data-file records.

The order of the fields in an output record can either be according to the default or it can be explicitly specified by the user.

In data=file-based output, the standard field order is as follows. All the fields listed are printed in this order if they are specified. If an output field delimiter is specified, it is put between every pair of adjacent fields. If there is no match for a given reference file (and the ‘--match-optional’ option is set for the file), all the fields that would normally be provided by that file are filled with spaces for fixed-width fields or zero-length for delimited output.

All the data-file output fields (in order)
The constant string set for the data file
For each reference file
- - The constant string set for the reference file
- - All the reference-file output fields

In reference=file-based output, the standard field order is as follows. All the fields listed are printed in this order if they are specified. If an output field delimiter is specified, it is put between every pair of adjacent fields.

All the reference-file output fields OR the key fields if no output fields are given
A 1/0 flag indicating whether there was any match
A counter of the number of data records matched
A sum of each of the data-file sum fields from each matching data-file record

The order of the fields in any output file can be customized using the ‘--field-order’ (or ‘-O’) option. The argument for the option is a comma-separated list of field identifiers. Each field identifier has 2 parts, a source and a type, separated by a period (.).

The sources are composed of an ‘r’ for reference file or ‘d’ for data file followed by an optional number. The number indicates which reference file the field comes from and is ignored for data files. Without a number, the first of each is taken.

A third source ‘s’ represents a substitution in the event that the preceding reference file field could not be provided because there was no match between that reference file and the data file. The number following it, if blank or zero, tells combine to take the field from the data file. Any other number means the corresponding reference file. This allows the conditional update of fields from the data file, or a prioritization of selections from a variety of reference files. If you are working with fixed-width fields, you should ensure that the lengths of the various fields in the substitution chain are the same.

The types are composed similarly. The identifiers are listed below. The number is ignored for identifiers of string constants, flags, and counters. For output fields, a hyphen-separated range of fields can be written to avoid having to write a long list. Any number provided is the number of the field in the order it was specified in the ‘-o’ or ‘-s’ option on the command line. In delimited-field files this may differ from the field number used in those options.

‘o’: Output fields from either reference or data files.
‘k’: String constant for either reference or data files.
‘f’: Flag (1/0) for reference files.
‘n’: Counter for reference files.
‘s’: Sum field for reference files.

Here is an example:

--field-order d.o1,d.o2,d.o3,d.k,r1.o1,s2.o1,s0.o4

--field-order d.o1-3,d.k,r1.o1,s2.o1,s0.o4

In this case, the first three fields from the data file are followed by the constant string from the data file. Then, if there was a match to reference file 1, the first field from that file is taken, otherwise if there was a match to reference file 2, the first field from that file is taken. If neither file matched the data record, the fourth field from the data record is taken instead.

The second line is equivalent, using a range of fields for convenience.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

This document was generated by Daniel P. Valentine on July 28, 2013 using texi2html 1.82.