4.5.4 Working With Comma Separated Value Files

Many commonly-used tools use a comma to separate fields, instead of whitespace. This is particularly true of popular spreadsheet programs. There is no universally accepted standard for the format of these files, although RFC 4180 lists the common practices.

For decades, anyone wishing to work with CSV files and awk had to “roll their own” solution. (For an example, see Defining Fields by Content). In 2023, Brian Kernighan decided to add CSV support to his version of awk. In order to keep up, gawk too provides the same support as his version. To use CSV data, invoke gawk with either of the -k or --csv options.

Fields in CSV files are separated by commas. In order to allow a comma to appear inside a field (i.e., as data), the field may be quoted by beginning and ending it with double quotes. In order to allow a double quote inside a field, the field must be quoted, and two double quotes represent an actual double quote. The double quote that starts a quoted field must be the first character after the comma. Table 4.1 shows some examples.

InputField Contents
abc defabc def
"quoted data"quoted data
"quoted, data"quoted, data
"She said ""Stop!""."She said "Stop!".

Table 4.1: Examples of CSV data

Additionally, and here’s where it gets messy, newlines are also allowed inside double-quoted fields! In order to deal with such things, when processing CSV files, gawk scans the input data looking for newlines that are not enclosed in double quotes. Thus, use of the --csv option totally overrides normal record processing with RS (see How Input Is Split into Records), as well as field splitting with any of FS, FIELDWIDTHS, or FPAT.

Carriage-Return–Line-Feed Line Endings In CSV Files

\r\n is the invention of the devil.

Brian Kernighan

Many CSV files are imported from systems where the line terminator for text files is a carriage-return–line-feed pair (CR-LF, ‘\r’ followed by ‘\n’). For ease of use, when processing CSV files, gawk converts CR-LF pairs into a single newline. That is, the ‘\r’ is removed.

This occurs only when a CR is paired with an LF; a standalone CR is left alone. This behavior is consistent with Windows systems which automatically convert CR-LF in files into a plain LF in memory, and also with the commonly available unix2dos utility program.

The behavior of the split() function (not formally discussed yet, see String-Manipulation Functions) differs slightly when processing CSV files. When called with two arguments (‘split(string, array)’), split() does CSV-based splitting. Otherwise, it behaves normally.

If --csv has been used, PROCINFO["CSV"] will exist. Otherwise, it will not. See Built-in Variables That Convey Information.

Finally, if --csv has been used, assigning a value to any of FS, FIELDWIDTHS, FPAT, or RS generates a warning message.

To be clear, gawk takes RFC 4180 as its specification for CSV input data. There are no mechanisms for accepting nonstandard CSV data, such as files that use a semicolon instead of a comma as the separator.