Many commonly-used tools use a comma to separate fields, instead of whitespace. This is particularly true of popular spreadsheet programs. There is no universally accepted standard for the format of these files, although RFC 4180 lists the common practices.
For decades, anyone wishing to work with CSV files and awk
had to “roll their own” solution.
(For an example, see Defining Fields by Content).
In 2023, Brian Kernighan decided to add CSV support to his version of
awk
. In order to keep up, gawk
too provides the same
support as his version.
To use CSV data, invoke gawk
with either of the
-k or --csv options.
Fields in CSV files are separated by commas. In order to allow a comma to appear inside a field (i.e., as data), the field may be quoted by beginning and ending it with double quotes. In order to allow a double quote inside a field, the field must be quoted, and two double quotes represent an actual double quote. The double quote that starts a quoted field must be the first character after the comma. Table 4.1 shows some examples.
Input | Field Contents |
---|---|
abc def | abc def |
"quoted data" | quoted data |
"quoted, data" | quoted, data |
"She said ""Stop!""." | She said "Stop!". |
Additionally, and here’s where it gets messy, newlines are also
allowed inside double-quoted fields!
In order to deal with such things, when processing CSV files,
gawk
scans the input data looking for newlines that
are not enclosed in double quotes. Thus, use of the --csv option
totally overrides normal record processing with RS
(see How Input Is Split into Records),
as well as field splitting with any of FS
, FIELDWIDTHS
,
or FPAT
.
Carriage-Return–Line-Feed Line Endings In CSV Files |
---|
— Brian Kernighan
Many CSV files are imported from systems where the line terminator
for text files is a carriage-return–line-feed pair
(CR-LF, ‘\r’ followed by ‘\n’).
For ease of use, when processing CSV files, This occurs only when a CR is paired with an LF; a standalone CR
is left alone. This behavior is consistent with Windows systems
which automatically convert CR-LF in files into a plain LF in memory,
and also with the commonly available |
The behavior of the split()
function (not formally discussed
yet, see String-Manipulation Functions) differs slightly when processing CSV
files. When called with two arguments
(‘split(string, array)’), split()
does CSV-based splitting. Otherwise, it behaves normally.
If --csv has been used, PROCINFO["CSV"]
will
exist. Otherwise, it will not. See Built-in Variables That Convey Information.
Finally, if --csv has been used, assigning a value
to any of FS
, FIELDWIDTHS
, FPAT
, or
RS
generates a warning message.
To be clear, gawk
takes
RFC 4180 as its
specification for CSV input data. There are no mechanisms
for accepting nonstandard CSV data, such as files that use
a semicolon instead of a comma as the separator.