11.2.4 Splitting a Large File into Pieces

The split utility splits large text files into smaller pieces. The usage follows the POSIX standard for split and is as follows:

split [-l count] [-a suffix-len] [file [outname]]
split -b N[k|m]] [-a suffix-len] [file [outname]]

By default, the output files are named xaa, xab, and so on. Each file has 1,000 lines in it, with the likely exception of the last file.

The split program has evolved over time, and the current POSIX version is more complicated than the original Unix version. The options and what they do are as follows:

-a suffix-len

Use suffix-len characters for the suffix. For example, if suffix-len is four, the output files would range from xaaaa to xzzzz.

-b N[k|m]]

Instead of each file containing a specified number of lines, each file should have (at most) N bytes. Supplying a trailing ‘k’ multiplies N by 1,024, yielding kilobytes. Supplying a trailing ‘m’ multiplies N by 1,048,576 (1,024 * 1,024) yielding megabytes. (This option is mutually exclusive with -l).

-l count

Each file should have at most count lines, instead of the default 1,000. (This option is mutually exclusive with -b).

If supplied, file is the input file to read. Otherwise standard input is processed. If supplied, outname is the leading prefix to use for file names, instead of ‘x’.

In order to use the -b option, gawk should be invoked with its -b option (see Command-Line Options), or with the environment variable LC_ALL set to ‘C’, so that each input byte is treated as a separate character.78

Here is an implementation of split in awk. It uses the getopt() function presented in Processing Command-Line Options.

The program begins with a standard descriptive comment and then a usage() function describing the options. The variable common keeps the function’s lines short so that they look nice on the page:

# split.awk --- do split in awk
#
# Requires getopt() library function.

function usage(     common)
{
    common = "[-a suffix-len] [file [outname]]"
    printf("usage: split [-l count]  %s\n", common) > "/dev/stderr"
    printf("       split [-b N[k|m]] %s\n", common) > "/dev/stderr"
    exit 1
}

Next, in a BEGIN rule we set the default values and parse the arguments. After that we initialize the data structures used to cycle the suffix from ‘aa…’ to ‘zz…’. Finally we set the name of the first output file:

BEGIN {
    # Set defaults:
    Suffix_length = 2
    Line_count = 1000
    Byte_count = 0
    Outfile = "x"

    parse_arguments()

    init_suffix_data()

    Output = (Outfile compute_suffix())
}

Parsing the arguments is straightforward. The program follows our convention (see Naming Library Function Global Variables) of having important global variables start with an uppercase letter:

function parse_arguments(   i, c, l, modifier)
{
    while ((c = getopt(ARGC, ARGV, "a:b:l:")) != -1) {
        if (c == "a")
            Suffix_length = Optarg + 0
        else if (c == "b") {
            Byte_count = Optarg + 0
            Line_count = 0

            l = length(Optarg)
            modifier = substr(Optarg, l, 1)
            if (modifier == "k")
                Byte_count *= 1024
            else if (modifier == "m")
                Byte_count *= 1024 * 1024
        } else if (c == "l") {
            Line_count = Optarg + 0
            Byte_count = 0
        } else
            usage()
    }

    # Clear out options
    for (i = 1; i < Optind; i++)
        ARGV[i] = ""

    # Check for filename
    if (ARGV[Optind]) {
        Optind++

        # Check for different prefix
        if (ARGV[Optind]) {
            Outfile = ARGV[Optind]
            ARGV[Optind] = ""

            if (++Optind < ARGC)
                usage()
        }
    }
}

Managing the file name suffix is interesting. Given a suffix of length three, say, the values go from ‘aaa’, ‘aab’, ‘aac’ and so on, all the way to ‘zzx’, ‘zzy’, and finally ‘zzz’. There are two important aspects to this:

The computation is handled by compute_suffix(). This function is called every time a new file is opened.

The flow here is messy, because we want to generate ‘zzzz’ (say), and use it, and only produce an error after all the file name suffixes have been used up. The logical steps are as follows:

  1. Generate the suffix, saving the value in result to return. To do this, the supplementary array Suffix_ind contains one element for each letter in the suffix. Each element ranges from 1 to 26, acting as the index into a string containing all the lowercase letters of the English alphabet. It is initialized by init_suffix_data(). result is built up one letter at a time, using each substr().
  2. Prepare the data structures for the next time compute_suffix() is called. To do this, we loop over Suffix_ind, backwards. If the current element is less than 26, it’s incremented and the loop breaks (‘abq’ goes to ‘abr’). Otherwise, the element is reset to one and we move down the list (‘abz’ to ‘aca’). Thus, the Suffix_ind array is always “one step ahead” of the actual file name suffix to be returned.
  3. Check if we’ve gone past the limit of possible file names. If Reached_last is true, print a message and exit. Otherwise, check if Suffix_ind describes a suffix where all the letters are ‘z’. If that’s the case we’re about to return the final suffix. If so, we set Reached_last to true so that the next call to compute_suffix() will cause a failure.

Physically, the steps in the function occur in the order 3, 1, 2:

function compute_suffix(    i, result, letters)
{
    # Logical step 3
    if (Reached_last) {
        printf("split: too many files!\n") > "/dev/stderr"
        exit 1
    } else if (on_last_file())
        Reached_last = 1    # fail when wrapping after 'zzz'

    # Logical step 1
    result = ""
    letters = "abcdefghijklmnopqrstuvwxyz"
    for (i = 1; i <= Suffix_length; i++)
        result = result substr(letters, Suffix_ind[i], 1)

    # Logical step 2
    for (i = Suffix_length; i >= 1; i--) {
        if (++Suffix_ind[i] > 26) {
            Suffix_ind[i] = 1
        } else
            break
    }

    return result
}

The Suffix_ind array and Reached_last are initialized by init_suffix_data():

function init_suffix_data(  i)
{
    for (i = 1; i <= Suffix_length; i++)
        Suffix_ind[i] = 1

    Reached_last = 0
}

The function on_last_file() returns true if Suffix_ind describes a suffix where all the letters are ‘z’ by checking that all the elements in the array are equal to 26:

function on_last_file(  i, on_last)
{
    on_last = 1
    for (i = 1; i <= Suffix_length; i++) {
        on_last = on_last && (Suffix_ind[i] == 26)
    }

    return on_last
}

The actual work of splitting the input file is done by the next two rules. Since splitting by line count and splitting by byte count are mutually exclusive, we simply use two separate rules, one for when Line_count is greater than zero, and another for when Byte_count is greater than zero.

The variable tcount counts how many lines have been processed so far. When it exceeds Line_count, it’s time to close the previous file and switch to a new one:

Line_count > 0 {
    if (++tcount > Line_count) {
        close(Output)
        Output = (Outfile compute_suffix())
        tcount = 1
    }
    print > Output
}

The rule for handling bytes is more complicated. Since lines most likely vary in length, the Byte_count boundary may be hit in the middle of an input record. In that case, split has to write enough of the first bytes of the input record to finish up Byte_count bytes, close the file, open a new file, and write the rest of the record to the new file. The logic here does all that:

Byte_count > 0 {
    # `+ 1' is for the final newline
    if (tcount + length($0) + 1 > Byte_count) { # would overflow
        # compute leading bytes
        leading_bytes = Byte_count - tcount

        # write leading bytes
        printf("%s", substr($0, 1, leading_bytes)) > Output

        # close old file, open new file
        close(Output)
        Output = (Outfile compute_suffix())

        # set up first bytes for new file
        $0 = substr($0, leading_bytes + 1)  # trailing bytes
        tcount = 0
    }

    # write full record or trailing bytes
    tcount += length($0) + 1
    print > Output
}

Finally, the END rule cleans up by closing the last output file:

END {
    close(Output)
}

Footnotes

(78)

Using -b twice requires separating gawk’s options from those of the program. For example: ‘gawk -f getopt.awk -f split.awk -b -- -b 42m large-file.txt split-’.