Previous: , Up: Counting Things   [Contents][Index]


11.2.7.3 Code for wc.awk

The usage for wc is as follows:

wc [-lwcm] [files …]

If no files are specified on the command line, wc reads its standard input. If there are multiple files, it also prints total counts for all the files. The options and their meanings are as follows:

-c

Count only bytes. Once upon a time, the ‘c’ in this option stood for “characters.” But, as explained earlier, bytes and character are no longer synonymous with each other.

-l

Count only lines.

-m

Count only characters.

-w

Count only words. A “word” is a contiguous sequence of nonwhitespace characters, separated by spaces and/or TABs. Luckily, this is the normal way awk separates fields in its input data.

Implementing wc in awk is particularly elegant, because awk does a lot of the work for us; it splits lines into words (i.e., fields) and counts them, it counts lines (i.e., records), and it can easily tell us how long a line is in characters.

This program uses the getopt() library function (see Processing Command-Line Options) and the file-transition functions (see Noting Data file Boundaries).

This version has one notable difference from older versions of wc: it always prints the counts in the order lines, words, characters and bytes. Older versions note the order of the -l, -w, and -c options on the command line, and print the counts in that order. POSIX does not mandate this behavior, though.

The BEGIN rule does the argument processing. The variable print_total is true if more than one file is named on the command line:

# wc.awk --- count lines, words, characters, bytes

# Options:
#    -l    only count lines
#    -w    only count words
#    -c    only count bytes
#    -m    only count characters
#
# Default is to count lines, words, bytes
#
# Requires getopt() and file transition library functions
# Requires mbs extension from gawkextlib

@load "mbs"

BEGIN {
    # let getopt() print a message about
    # invalid options. we ignore them
    while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) {
        if (c == "l")
            do_lines = 1
        else if (c == "w")
            do_words = 1
        else if (c == "c")
            do_bytes = 1
        else if (c == "m")
            do_chars = 1
    }
    for (i = 1; i < Optind; i++)
        ARGV[i] = ""

    # if no options, do lines, words, bytes
    if (! do_lines && ! do_words && ! do_chars && ! do_bytes)
        do_lines = do_words = do_bytes = 1

    print_total = (ARGC - i > 1)
}

The beginfile() function is simple; it just resets the counts of lines, words, characters and bytes to zero, and saves the current file name in fname:

function beginfile(file)
{
    lines = words = chars = bytes = 0
    fname = FILENAME
}

The endfile() function adds the current file’s numbers to the running totals of lines, words, and characters. It then prints out those numbers for the file that was just read. It relies on beginfile() to reset the numbers for the following data file:

function endfile(file)
{
    tlines += lines
    twords += words
    tchars += chars
    tbytes += bytes
    if (do_lines)
        printf "\t%d", lines
    if (do_words)
        printf "\t%d", words
    if (do_chars)
        printf "\t%d", chars
    if (do_bytes)
        printf "\t%d", bytes
    printf "\t%s\n", fname
}

There is one rule that is executed for each line. It adds the length of the record, plus one, to chars. Adding one plus the record length is needed because the newline character separating records (the value of RS) is not part of the record itself, and thus not included in its length. Similarly, it adds the length of the record in bytes, plus one, to bytes. Next, lines is incremented for each line read, and words is incremented by the value of NF, which is the number of “words” on this line:

# do per line
{
    chars += length($0) + 1    # get newline
    bytes += mbs_length($0) + 1
    lines++
    words += NF
}

Finally, the END rule simply prints the totals for all the files:

END {
    if (print_total) {
        if (do_lines)
            printf "\t%d", tlines
        if (do_words)
            printf "\t%d", twords
        if (do_chars)
            printf "\t%d", tchars
        if (do_bytes)
            printf "\t%d", tbytes
        print "\ttotal"
    }
}

Previous: A Brief Introduction To Extensions, Up: Counting Things   [Contents][Index]