Splitting By Content (The GNU Awk User’s Guide)

Next: Checking How gawk Is Splitting Records, Previous: Reading Fixed-Width Data, Up: Reading Input Files [Contents][Index]

4.7 Defining Fields by Content ¶

This section discusses an advanced feature of gawk. If you are a novice awk user, you might want to skip it on the first reading.

Normally, when using FS, gawk defines the fields as the parts of the record that occur in between each field separator. In other words, FS defines what a field is not, instead of what a field is. However, there are times when we really want to define the fields by what they are, and not by what they are not.

The FPAT variable offers a solution for cases like this. The value of FPAT should be a string that provides a regular expression. This regular expression describes the contents of each field.

We can explore the strengths, and some limitations, of FPAT using the case of comma-separated values (CSV) data. This case is somewhat obsolete as gawk now has built-in CSV parsing (see Working With Comma Separated Value Files). Nonetheless, it remains useful as an example of what FPAT-based field parsing can do. It is also useful for versions of gawk prior to version 5.3.

Many spreadsheet programs, for example, can export their data into text files, where each record is terminated with a newline, and fields are separated by commas. If commas only separated the data, there wouldn’t be an issue with using ‘FS = ","’ to split the data into fields. The problem comes when one of the fields contains an embedded comma. In such cases, most programs embed the field in double quotes.²⁷ So, we might have data like this:

Robbins,Arnold,,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA

In the case of CSV data as presented here, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” We also need to bear in mind that some fields may be empty. If written as a regular expression constant (see Regular Expressions), we would have /([^,]*)|("[^"]+")/. Writing this as a string requires us to escape the double quotes, leading to:

FPAT = "([^,]*)|(\"[^\"]+\")"

Putting this to use, here is a simple program to parse the data:

BEGIN {
    FPAT = "([^,]*)|(\"[^\"]+\")"
}

{
    print "NF =", NF
    for (i = 1; i <= NF; i++) {
        printf("$%d = <%s>\n", i, $i)
    }
}

When run, we get the following:

$ gawk -f simple-csv.awk addresses.csv
-| NF = 8
-| $1 = <Robbins>
-| $2 = <Arnold>
-| $3 = <>
-| $4 = <"1234 A Pretty Street, NE">
-| $5 = <MyTown>
-| $6 = <MyState>
-| $7 = <12345-6789>
-| $8 = <USA>

Note the empty data field in the value of $3 and the embedded comma in the value of $4, in which the data remains wrapped in its enclosing double quotes.

The use of enclosing double quotes has a consequence for fields that contain double quotes as part of the data itself. For those fields, a double quote appearing inside a field must be escaped by preceding it with another double quote, as shown in the third and fourth lines of the following example:

1,2,3
p,"q,r",s
p,"q""r",s
p,"q,""r",s
p,"",s
p,,s

Manuel Collado suggests that the simplest FPAT expression that recognizes this kind of CSV field is /([^,]*)|("([^"]|"")+")/.

The following program uses this improved FPAT expression to split the example CSV fields above, extracts the underlying data from any double-quoted fields, and finally prints the data as tab-separated values. The latter is accomplished by setting OFS to a TAB character.

BEGIN {
    FPAT = "([^,]*)|(\"([^\"]|\"\")+\")"
    OFS = "\t"    # Print tab-separated values
}

{
    for (i = 1; i <= NF; i++) {
        # Extract data from double-quoted fields
        if (substr($i, 1, 1) == "\"") {
            gsub(/^"|"$/, "", $i)    # Remove enclosing quotes
            gsub(/""/, "\"", $i)    # Convert "" to "
        }
    }
    $1 = $1    # force rebuild of the record
    print
}

When run, it produces:

$ gawk -f quoted-csv.awk sample.csv
-| 1       2       3
-| p       q,r     s
-| p       q"r     s
-| p       q,"r    s
-| p               s
-| p               s

Some programs export CSV data that contain embedded newlines between the double quotes, and here we run into a limitation of FPAT: it provides no way to deal with this. Hence, using FPAT to do your own CSV parsing is an elegant approach for the majority of cases, but not all.

For a more general solution to working with CSV data, see Working With Comma Separated Value Files. If your version of gawk is prior to version 5.3, we recommend that you use Manuel Collado’s CSVMODE library for gawk.

As with FS, the IGNORECASE variable (see Built-in Variables That Control awk) affects field splitting with FPAT.

Assigning a value to FPAT overrides field splitting with FS and with FIELDWIDTHS.

Finally, the patsplit() function makes the same functionality available for splitting regular strings (see String-Manipulation Functions).

FS Versus FPAT: A Subtle Difference

Footnotes

(27)

The CSV format lacked a formal standard definition for many years. RFC 4180 standardizes the most common practices.