3.2 Escape Sequences
Some characters cannot be included literally in string constants
("foo") or regexp constants (/foo/).
Instead, they should be represented with escape sequences,
which are character sequences beginning with a backslash (‘\’).
One use of an escape sequence is to include a double-quote character in
a string constant. Because a plain double quote ends the string, you
must use ‘\"’ to represent an actual double-quote character as a
part of the string. For example:
$ awk 'BEGIN { print "He said \"hi!\" to her." }'
-| He said "hi!" to her.
The backslash character itself is another character that cannot be
included normally; you must write ‘\\’ to put one backslash in the
string or regexp. Thus, the string whose contents are the two characters
‘"’ and ‘\’ must be written "\"\\".
Other escape sequences represent unprintable characters
such as TAB or newline. While there is nothing to stop you from entering most
unprintable characters directly in a string constant or regexp constant,
they may look ugly.
The following table lists
all the escape sequences used in awk and
what they represent. Unless noted otherwise, all these escape
sequences apply to both string constants and regexp constants:
\\- A literal backslash, ‘\’.
\a- The “alert” character, Ctrl-g, ASCII code 7 (BEL).
(This usually makes some sort of audible noise.)
\b- Backspace, Ctrl-h, ASCII code 8 (BS).
\f- Formfeed, Ctrl-l, ASCII code 12 (FF).
\n- Newline, Ctrl-j, ASCII code 10 (LF).
\r- Carriage return, Ctrl-m, ASCII code 13 (CR).
\t- Horizontal TAB, Ctrl-i, ASCII code 9 (HT).
\v- Vertical tab, Ctrl-k, ASCII code 11 (VT).
\nnn- The octal value nnn, where nnn stands for 1 to 3 digits
between ‘0’ and ‘7’. For example, the code for the ASCII ESC
(escape) character is ‘\033’.
\xhh...- The hexadecimal value hh, where hh stands for a sequence
of hexadecimal digits (‘0’–‘9’, and either ‘A’–‘F’
or ‘a’–‘f’). Like the same construct
in ISO C, the escape sequence continues until the first nonhexadecimal
digit is seen. (c.e.)
However, using more than two hexadecimal digits produces
undefined results. (The ‘\x’ escape sequence is not allowed in
POSIX awk.)
\/- A literal slash (necessary for regexp constants only).
This sequence is used when you want to write a regexp
constant that contains a slash. Because the regexp is delimited by
slashes, you need to escape the slash that is part of the pattern,
in order to tell awk to keep processing the rest of the regexp.
\"- A literal double quote (necessary for string constants only).
This sequence is used when you want to write a string
constant that contains a double quote. Because the string is delimited by
double quotes, you need to escape the quote that is part of the string,
in order to tell awk to keep processing the rest of the string.
In gawk, a number of additional two-character sequences that begin
with a backslash have special meaning in regexps.
See GNU Regexp Operators.
In a regexp, a backslash before any character that is not in the previous list
and not listed in
GNU Regexp Operators,
means that the next character should be taken literally, even if it would
normally be a regexp operator. For example, /a\+b/ matches the three
characters ‘a+b’.
For complete portability, do not use a backslash before any character not
shown in the previous list.
To summarize:
- The escape sequences in the table above are always processed first,
for both string constants and regexp constants. This happens very early,
as soon as awk reads your program.
- gawk processes both regexp constants and dynamic regexps
(see Computed Regexps),
for the special operators listed in
GNU Regexp Operators.
- A backslash before any other character means to treat that character
literally.
|
Backslash Before Regular Characters
If you place a backslash in a string constant before something that is
not one of the characters previously listed, POSIX awk purposely
leaves what happens as undefined. There are two choices:
- Strip the backslash out
- This is what Brian Kernighan's awk and gawk both do.
For example,
"a\qc" is the same as "aqc".
(Because this is such an easy bug both to introduce and to miss,
gawk warns you about it.)
Consider ‘FS = "[ \t]+\|[ \t]+"’ to use vertical bars
surrounded by whitespace as the field separator. There should be
two backslashes in the string: ‘FS = "[ \t]+\\|[ \t]+"’.)
- Leave the backslash alone
- Some other awk implementations do this.
In such implementations, typing
"a\qc" is the same as typing
"a\\qc".
|
|
Escape Sequences for Metacharacters
Suppose you use an octal or hexadecimal
escape to represent a regexp metacharacter.
(See Regexp Operators.)
Does awk treat the character as a literal character or as a regexp
operator?
Historically, such characters were taken literally.
(d.c.)
However, the POSIX standard indicates that they should be treated
as real metacharacters, which is what gawk does.
In compatibility mode (see Options),
gawk treats the characters represented by octal and hexadecimal
escape sequences literally when used in regexp constants. Thus,
/a\52b/ is equivalent to /a\*b/.
|