Next: , Previous: Regexp Operators, Up: Regexp


3.4 Using Bracket Expressions

As mentioned earlier, a bracket expression matches any character amongst those listed between the opening and closing square brackets.

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, based upon the system's native character set. For example, ‘[0-9]’ is equivalent to ‘[0123456789]’. (See Ranges and Locales, for an explanation of how the POSIX standard and gawk have changed over time. This is mainly of historical interest.)

To include one of the characters ‘\’, ‘]’, ‘-’, or ‘^’ in a bracket expression, put a ‘\’ in front of it. For example:

     [d\]]

matches either ‘d’ or ‘]’.

This treatment of ‘\’ in bracket expressions is compatible with other awk implementations and is also mandated by POSIX. The regular expressions in awk are a superset of the POSIX specification for Extended Regular Expressions (EREs). POSIX EREs are based on the regular expressions accepted by the traditional egrep utility.

Character classes are a feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but the actual characters can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs between the United States and France.

A character class is only valid in a regexp inside the brackets of a bracket expression. Character classes consist of ‘[:’, a keyword denoting the class, and ‘:]’. table-char-classes lists the character classes defined by the POSIX standard.

Class Meaning
[:alnum:] Alphanumeric characters.
[:alpha:] Alphabetic characters.
[:blank:] Space and TAB characters.
[:cntrl:] Control characters.
[:digit:] Numeric characters.
[:graph:] Characters that are both printable and visible. (A space is printable but not visible, whereas an ‘a’ is both.)
[:lower:] Lowercase alphabetic characters.
[:print:] Printable characters (characters that are not control characters).
[:punct:] Punctuation characters (characters that are not letters, digits, control characters, or space characters).
[:space:] Space characters (such as space, TAB, and formfeed, to name a few).
[:upper:] Uppercase alphabetic characters.
[:xdigit:] Characters that are hexadecimal digits.

Table 3.1: POSIX Character Classes

For example, before the POSIX standard, you had to write /[A-Za-z0-9]/ to match alphanumeric characters. If your character set had other alphabetic characters in it, this would not match them. With the POSIX character classes, you can write /[[:alnum:]]/ to match the alphabetic and numeric characters in your character set.

Two additional special sequences can appear in bracket expressions. These apply to non-ASCII character sets, which can have single symbols (called collating elements) that are represented with more than one character. They can also have several characters that are equivalent for collating, or sorting, purposes. (For example, in French, a plain “e” and a grave-accented “è” are equivalent.) These sequences are:

Collating symbols
Multicharacter collating elements enclosed between ‘[.’ and ‘.]’. For example, if ‘ch’ is a collating element, then [[.ch.]] is a regexp that matches this collating element, whereas [ch] is a regexp that matches either ‘c’ or ‘h’.


Equivalence classes
Locale-specific names for a list of characters that are equal. The name is enclosed between ‘[=’ and ‘=]’. For example, the name ‘e’ might be used to represent all of “e,” “è,” and “é.” In this case, [[=e=]] is a regexp that matches any of ‘e’, ‘é’, or ‘è’.

These features are very valuable in non-English-speaking locales.

CAUTION: The library functions that gawk uses for regular expression matching currently recognize only POSIX character classes; they do not recognize collating symbols or equivalence classes.