B.3.1.3 Using gawk on PC Operating Systems

Information in this section applies to the MinGW port of gawk for MS-Windows. See Using gawk In The Cygwin Environment for information about the Cygwin port.

The MS-Windows version of gawk searches for program files as described in The AWKPATH Environment Variable. However, semicolons (rather than colons) separate elements in the AWKPATH variable. If AWKPATH is not set or is empty, then the default search path is ‘.;d:/usr/lib/awk;c:/lib/awk;c:/gnu/lib/awk’.

Similarly, the AWKLIBPATH environment variable, which tells gawk where to look for dynamic extensions (see Writing Extensions for gawk), also uses semicolons to separate directories. If not set in the environment, the default value hard-coded into gawk is ‘d:/usr/lib/gawk/ext-api-version’, where api-version is the version of gawk API version for which gawk was compiled (see API Version Constants and Variables).

MS-Windows traditionally supported internationalization and localization via codepages. Each Windows locale specifies a system-wide codepage, which defines both the non-ASCII characters supported by Windows and their encoding into small integer values. Most Windows locales use a single-byte encoding; the notable exceptions are CJK (Chinese, Japanese, and Korean) codepages which use one or two bytes for each non-ASCII character. Recent versions of MS-Windows also support Unicode and the UTF-8 encoding, but introduction of this support is slow: as of this writing, setting UTF-8 as the system-wide codepage on Windows is still considered an experimental feature, which by default is turned off. The Windows designation of the UTF-8 encoding is known as codepage 65001.

The MinGW build of gawk for MS-Windows supports multibyte characters according to the codepage defined by the current system locale, and it also includes (starting with version 5.4) reasonable support for text encoded in UTF-8. When the system’s codepage is anything other than 65001, gawk uses the functions from the MS-Windows C runtime library to convert between multibyte and wide-character representation of text, and for character classification and collation; this is in most cases limited to characters inside the Unicode Basic Multilingual Plane (BMP), due to the limitations of the Windows runtime. By contrast, if the system’s codepage is set to 65001, gawk attempts to support the full range of Unicode codepoints, including the BMP and all the 16 Supplementary Planes defined by Unicode, by using its own conversion and classification routines instead of those in the Windows C runtime. (This might still fail to work as well as it does on modern POSIX systems, because these routines rely on the character database data included with your Windows installation.)

When codepage 65001 is used, gawk uses UTF-8 to encode multibyte text, and converts between UTF-8 and 32-bit Unicode codepoints internally. Thus on Windows, to make sure awk programs written for POSIX systems which handle multibyte non-ASCII text work correctly on MS-Windows, you should use codepage 65001.

As a special feature, gawk version 5.4 and later will use UTF-8 internally even when the system-wide codepage is something other than 65001, if gawk detects that the Windows terminal window in which it runs uses codepage 65001 for console output. (Unlike on POSIX systems, Windows users can change the console encoding independently of the system-wide locale’s encoding.) To switch the Windows terminal to using UTF-8 encoding, type ‘chcp 65001’ at the Windows shell’s prompt. This causes gawk on Windows to support the full range of Unicode characters even without changing the system-wide locale’s codepage. We recommend that you use codepage 65001 on Windows 11 and later systems, where the Windows terminal supports UTF-8 quite well.

Note that gawk on Windows will still treat all input data as single-byte characters when invoked with the -b command-line option (see Command-Line Options), or when running under the "C" locale (i.e., when the LC_ALL environment variable is set to the value ‘C’ by typing ‘set LC_ALL=C’ at the shell prompt before invoking gawk). This overrides the codepage-defined encoding of text.

Under MS-Windows, gawk (and many other text programs) silently translates end-of-line ‘\r\n’ to ‘\n’ on input and ‘\n’ to ‘\r\n’ on output. A special BINMODE variable (c.e.) allows control over these translations and is interpreted as follows:

The modes for standard input and standard output are set one time only (after the command line is read, but before processing any of the awk program). Setting BINMODE for standard input or standard output is accomplished by using an appropriate ‘-v BINMODE=N’ option on the command line. BINMODE is set at the time a file or pipe is opened and cannot be changed midstream.

On POSIX-compatible systems, this variable’s value has no effect. Thus, if you think your program will run on multiple different systems and that you may need to use BINMODE, you should simply set it (in the program or on the command line) unconditionally, and not worry about the operating system on which your program is running.

The name BINMODE was chosen to match mawk (see Other Freely Available awk Implementations). mawk and gawk handle BINMODE similarly; however, mawk adds a ‘-W BINMODE=N’ option and an environment variable that can set BINMODE, RS, and ORS. The files binmode[1-3].awk (under gnu/lib/awk in some of the prepared binary distributions) have been chosen to match mawk’s ‘-W BINMODE=N’ option. These can be changed or discarded; in particular, the setting of RS giving the fewest “surprises” is open to debate. mawk uses ‘RS = "\r\n"’ if binary mode is set on read, which is appropriate for files with the MS-DOS-style end-of-line.

To illustrate, the following examples set binary mode on writes for standard output and other files, and set ORS as the “usual” MS-DOS-style end-of-line:

gawk -v BINMODE=2 -v ORS="\r\n" ...

or:

gawk -v BINMODE=w -f binmode2.awk ...

These give the same result as the ‘-W BINMODE=2’ option in mawk. The following changes the record separator to "\r\n" and sets binary mode on reads, but does not affect the mode on standard input:

gawk -v RS="\r\n" -e "BEGIN { BINMODE = 1 }" ...

or:

gawk -f binmode1.awk ...

With proper quoting, in the first example the setting of RS can be moved into the BEGIN rule.

Under MS-Windows, the MinGW port of gawk supports both the ‘|&’ operator and TCP/IP networking (see Using gawk for Network Programming).