combine Manual 0.4.0: 2.4 Reference Files

2.4 Reference Files

A reference file record is expected to match on a set of key fields to a data file record. The parts of a reference file that are necessary for processing are read entirely into memory. You can specify as many reference files as you want, depending only on the amount of memory your system can spare to hold them. For any reference file, it is minimally required that you specify a file name, a specification of the key fields in the reference file, and a specification of the matching key fields in the data file.

The following are the options that are related to reference files. They are all positional, and they apply to the processing of the previously named reference file. (Except of course for the reference file name itself, which applies to itself.)

‘-r filename’

‘--reference-file=filename’

Use filename as a reference file to match to the data file in processing. This option introduces a block of positional options that relate to this reference file’s processing in combine.

‘-k range_string’

‘--key-fields=range_string’

Use the fields specified by range_string as a key to match to a corresponding key in the data file.

‘-m range_string’

‘--data-key-fields=range_string’

Use the fields specified by range_string as the corresponding key to a key taken from a reference file.

‘-a range_string’

‘--hierarchy-key-fields=range_string’

Use the fields specified by range_string as a key to perform a recursive hierarchical match within the reference file. This key will be matched against values specified in the regular key on the reference file.

‘-u’

‘--unique’

Keep only one record for the reference file in memory for each distinct key. By default combine maintains all the records from the reference file in memory for processing. This default allows for cartesian products when a key exists multiple times in both the reference and data files.

‘-h number’

‘--hash-size=number’

Use the number provided as a base size for allocating a hash table to store the records from this reference file. If this number is too small, combine will fail when it tries to record a record it has no room for. If it is only a little bit too small, it will cause inefficiency as searching for open space in the hash table will be difficult.

‘-H keyword’

‘--hash-movement=keyword’

One of the keywords binary, number, beginning, or end, indicating how to turn the key into a number with the best variability and least overlap. The wise choice of this option can cut processing time significantly. The binary option is the default, and treats the last few bytes (8 on most computers) of the key string(s) as a big number. The number option converts the entire key to a number assuming it is a numeric string. The other two take the least significant 3 bits from each of the first or last few (21 where a 64 bit integer is available) bytes in the key strings and turns them into a number.

‘-w’

‘--write-output’

Signals the program that output records should be written for every record stored for this reference file. This will either be one record for every record in the reference file or one record for every distinct set of keys in the reference file, depending on the setting of the option ‘--unique’. The record written will include all specified output fields from the reference file record, any specified constant value for this reference file, and any flag, counter, or sums requested.

‘-t filename’

‘--output-file=filename’

If provided, write the output based on this reference file to filename. Otherwise the output will go to stdout. This option only makes sense if you plan to write output based on this reference file.

‘-o range_string’

‘--output-fields=range_string’

Write the fields specified by range_string as part of the record in any reference-file- or data-file-based output. The range specifications share a common format with all field specifications for combine.

‘-K string’

‘--output-constant=string’

Write string to the reference- or data-file-based output.

‘-U’

‘--up-hierarchy’

When traversing the hierarchy from a given reference-file record, use the values on that record in the ‘--hierarchy-key-fields’ fields to connect to the ‘--key-fields’ fields of other records from the reference file. For most purposes, the presence of the connection on the first record suggests a single parent in a standard hierarchy. The hierarchy traversal stops when the ‘--hierarchy-key-fields’ fields are empty.

If this option is not set, the ‘--key-fields’ fields are used to search for the same values in the ‘--hierarchy-key-fields’ fields of other records in the same file. This allows multiple children of an initial record, and suggests going down in the hierarchy. The hierarchy traversal stops when no further connection can be made. The traversal is depth-first.

‘-l’

‘--hierarchy-leaf-only’

When traversing a hierarchy, treat only the endpoints as matching records. Nodes that have onward connections are ignored except for navigating to the leaf nodes.

‘-F number’

‘--flatten-hierarchy=number’

When traversing a hierarchy, act as the ‘hierarchy-leaf-only’, except save information about the intervening nodes. Repeat the ‘output-fields’ fields number times (leaving them blank if there were fewer levels), starting from the first reference record matched.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

This document was generated by Daniel P. Valentine on July 28, 2013 using texi2html 1.82.