[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.1 Introduction

combine is primarily a program for merging files on a common key. While such things as table joins on keys are common in database systems like MySQL, there seems to be a marked failure in the availability for such processing on text files. combine is intended to fill that gap.

Another way of looking at combine is as the kernel of a database system without all the overhead (and safeguards) associated with such systems. The missing baggage that appeals most to me is the requirement to load data into somebody else’s format before working with it. combine’s design is intended to allow it to work with most data directly from the place where it already is.

In looking around for existing software that wanted to do what I wanted to do, the closest I came was the join utility in GNU and other operating systems. join has some limitations that I needed to overcome. In particular, in matching it works on only one field each in only two files. For such a match, on files whose fields are separated by a delimiter, I’m sure that join is a more efficient choice. Someday I’ll test that assumption.

Once I started writing the program, I had to come up with a name. Given that one of the earliest applications that I imagined for such a program would be to prepare data for storage in a data warehouse, I thought to where the things that are stored in physical warehouses come from. At that point, I came up with the name DataFactory. Unfortunately, just as I got ready to release it, I noticed that someone else has that as a registered trademark.

As a result, I have come up with the name combine. Like the farm implement of the same name, this program can be used to separate the wheat from the chaff. I also like it because it has a similarity to join reminiscent of the similarity of function.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.1.1 Ways to Use combine

Here are some almost real-world applications of combine that may whet your imagination for your own uses.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.1.1.1 Encrypted Data

Suppose you have a file that you would normally want to keep encrypted (say, your customer information, which includes their credit card numbers). Suppose further that you occasionally receive updates of some less sensitive information about the customers.

To update the sensitive file, you could unencrypt the file, load all the data from both files into a database, merge the data, extract the result from the database (making sure to wipe the disk storage that was used), and re-encrypt. That leaves the information open to DBAs and the like during the process, and it sure seems like a lot of steps.

Using combine and your favorite encryption program, the data never has to touch the disk unencrypted (except in a swap file) so casual observers are unlikely to run across sensitive information. In addition an updated encrypted file can be created in one command line, piping the input and output of the sensitive data through your encryption program.

Here’s a sample command that might do the trick.

 
gpg -d < secret_file \
| combine -w -o 1-300 \
              -r update_information -k 1-10 -m 1-10 -o 11-20 -p \
| gpg -e > updated_secret_file

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1.1.1.2 Data Validation

Another example that is often important when receiving data from the outside world is data validation.

Perhaps you receive product sales data from a number of retailers, which you need to combine with product information you have elsewhere. To confirm that the file you received is valid for reporting, you might need to check the product codes, ZIP codes, and retailer codes against your known valid values.

All the comparisons can be done in a single command, which will result in a status which will flag us down if anything did not match to our lists of expected values.

 
combine -w -o 1- \
            -r products.txt -k 1-18 -m 11-28 \
            -r zip.txt -k 1-5 -m 19-23 \
            -r customer.txt -k 1-10 -m 1-10 \
            input.txt \
  | cmp input.txt
result=$?

That’s probably enough if we are pretty sure that the incoming data is usually clean. If something does not match up, we can investigate by hand. On the other hand, if we expect to find differences a little more frequently, we can make some small changes.

The following command makes the match optional, but puts a constant ’$’ at the end of the record for each match among the three keys to be validated. When there isn’t a match, combine puts a space into the record in place of the ’$’. We can then search for something other than ’$$$’ at the end of the record to know which records didn’t match.

 
combine -w -o 1- \
            -r products.txt -k 1-18 -m 11-28 -k '$' -p \
            -r zip.txt -k 1-5 -m 19-23 -k '$' -p \
            -r customer.txt -k 1-10 -m 1-10 -k '$' -p \
            input.txt \
  | grep -v '\$\$\$$' > input.nomatch

[ < ] [ > ]   [ << ] [ Up ] [ >> ]

This document was generated by Daniel P. Valentine on July 28, 2013 using texi2html 1.82.