gawk on PC Operating Systems ¶Information in this section applies to the MinGW port of
gawk for MS-Windows. See Using gawk In The Cygwin Environment for information about
the Cygwin port.
The MS-Windows version of gawk searches for
program files as described in The AWKPATH Environment Variable. However,
semicolons (rather than colons) separate elements in the AWKPATH
variable. If AWKPATH is not set or is empty, then the default
search path is ‘.;d:/usr/lib/awk;c:/lib/awk;c:/gnu/lib/awk’.
Similarly, the AWKLIBPATH environment variable, which tells
gawk where to look for dynamic extensions (see Writing Extensions for gawk), also uses semicolons to separate directories. If not
set in the environment, the default value hard-coded into
gawk is ‘d:/usr/lib/gawk/ext-api-version’,
where api-version is the version of gawk API version for which
gawk was compiled (see API Version Constants and Variables).
MS-Windows traditionally supported internationalization and localization via codepages. Each Windows locale specifies a system-wide codepage, which defines both the non-ASCII characters supported by Windows and their encoding into small integer values. Most Windows locales use a single-byte encoding; the notable exceptions are CJK (Chinese, Japanese, and Korean) codepages which use one or two bytes for each non-ASCII character. Recent versions of MS-Windows also support Unicode and the UTF-8 encoding, but introduction of this support is slow: as of this writing, setting UTF-8 as the system-wide codepage on Windows is still considered an experimental feature, which by default is turned off. The Windows designation of the UTF-8 encoding is known as codepage 65001.
The MinGW build of gawk for MS-Windows supports multibyte
characters according to the codepage defined by the current system
locale, and it also includes (starting with version 5.4) reasonable
support for text encoded in UTF-8. When the system’s codepage is
anything other than 65001, gawk uses the functions from the
MS-Windows C runtime library to convert between multibyte and
wide-character representation of text, and for character
classification and collation; this is in most cases limited to
characters inside the Unicode Basic Multilingual Plane
(BMP), due to the limitations of the Windows runtime. By
contrast, if the system’s codepage is set to 65001, gawk
attempts to support the full range of Unicode codepoints, including
the BMP and all the 16 Supplementary Planes defined by
Unicode, by using its own conversion and classification routines
instead of those in the Windows C runtime. (This might still fail to
work as well as it does on modern POSIX systems, because these
routines rely on the character database data included with your
Windows installation.)
When codepage 65001 is used, gawk uses UTF-8 to encode
multibyte text, and converts between UTF-8 and 32-bit Unicode
codepoints internally. Thus on Windows, to make sure awk programs
written for POSIX systems which handle multibyte non-ASCII text work
correctly on MS-Windows, you should use codepage 65001.
As a special feature, gawk version 5.4 and later will use
UTF-8 internally even when the system-wide codepage is something other
than 65001, if gawk detects that the Windows terminal window
in which it runs uses codepage 65001 for console output. (Unlike on
POSIX systems, Windows users can change the console encoding
independently of the system-wide locale’s encoding.) To switch the
Windows terminal to using UTF-8 encoding, type ‘chcp 65001’ at the
Windows shell’s prompt. This causes gawk on Windows to
support the full range of Unicode characters even without changing the
system-wide locale’s codepage. We recommend that you use codepage
65001 on Windows 11 and later systems, where the Windows terminal
supports UTF-8 quite well.
Note that gawk on Windows will still treat all input data as
single-byte characters when invoked with the -b command-line
option (see Command-Line Options), or when running under
the "C" locale (i.e., when the LC_ALL environment
variable is set to the value ‘C’ by typing ‘set LC_ALL=C’ at
the shell prompt before invoking gawk). This overrides the
codepage-defined encoding of text.
Under MS-Windows,
gawk (and many other text programs) silently
translates end-of-line ‘\r\n’ to ‘\n’ on input and ‘\n’
to ‘\r\n’ on output. A special BINMODE variable (c.e.)
allows control over these translations and is interpreted as follows:
BINMODE is "r" or one,
then
binary mode is set on read (i.e., no translations on reads).
BINMODE is "w" or two,
then
binary mode is set on write (i.e., no translations on writes).
BINMODE is "rw" or "wr" or three,
binary mode is set for both read and write.
BINMODE=non-null-string is
the same as ‘BINMODE=3’ (i.e., no translations on
reads or writes). However, gawk issues a warning
message if the string is not one of "rw" or "wr".
The modes for standard input and standard output are set one time
only (after the
command line is read, but before processing any of the awk program).
Setting BINMODE for standard input or
standard output is accomplished by using an
appropriate ‘-v BINMODE=N’ option on the command line.
BINMODE is set at the time a file or pipe is opened and cannot be
changed midstream.
On POSIX-compatible systems, this variable’s value has no effect.
Thus, if you think your program will run on multiple different systems
and that you may need to use BINMODE, you should simply set it
(in the program or on the command line) unconditionally, and not worry
about the operating system on which your program is running.
The name BINMODE was chosen to match mawk
(see Other Freely Available awk Implementations).
mawk and gawk handle BINMODE similarly; however,
mawk adds a ‘-W BINMODE=N’ option and an environment
variable that can set BINMODE, RS, and ORS. The
files binmode[1-3].awk (under gnu/lib/awk in some of the
prepared binary distributions) have been chosen to match mawk’s ‘-W
BINMODE=N’ option. These can be changed or discarded; in particular,
the setting of RS giving the fewest “surprises” is open to debate.
mawk uses ‘RS = "\r\n"’ if binary mode is set on read, which is
appropriate for files with the MS-DOS-style end-of-line.
To illustrate, the following examples set binary mode on writes for standard
output and other files, and set ORS as the “usual” MS-DOS-style
end-of-line:
gawk -v BINMODE=2 -v ORS="\r\n" ...
or:
gawk -v BINMODE=w -f binmode2.awk ...
These give the same result as the ‘-W BINMODE=2’ option in
mawk.
The following changes the record separator to "\r\n" and sets binary
mode on reads, but does not affect the mode on standard input:
gawk -v RS="\r\n" -e "BEGIN { BINMODE = 1 }" ...
or:
gawk -f binmode1.awk ...
With proper quoting, in the first example the setting of RS can be
moved into the BEGIN rule.
Under MS-Windows, the MinGW port of gawk supports
both the ‘|&’ operator and TCP/IP networking
(see Using gawk for Network Programming).