Gcal 4.1: Regexp Operators

E.2 Regular Expression Operators

You can combine regular expressions with the following characters, called regular expression operators, or metacharacters, to increase the power and versatility of regular expressions.

Here is a table of these metacharacters. All characters that are not listed in the table stand for themselves.

\

This is used to suppress the special meaning of a character when matching. For example:

\$

matches the character ‘$’.

^

This matches the beginning of a string. For example:

^@chapter

matches the ‘@chapter’ at the beginning of a string, and can be used to identify chapter beginnings in Texinfo source files. The ‘^’ is known as an anchor, since it anchors the pattern to matching only at the beginning of the string.

$

This is similar to ‘^’, but it matches only at the end of a string. For example:

p$

matches a string that ends with a ‘p’. The ‘$’ is also an anchor.

.

The period, or dot, matches any single character. For example:

.P

matches any single character followed by a ‘P’ in a string. Using concatenation we can make a regular expression like ‘U.A’, which matches any three-character sequence that begins with ‘U’ and ends with ‘A’.

[…]

This is called a character list. It matches any one of the characters that are enclosed in the square brackets. For example:

[MVX]

matches any one of the characters ‘M’, ‘V’, or ‘X’ in a string.

Ranges of characters are indicated by using a hyphen between the beginning and ending characters, and enclosing the whole thing in brackets. For example:

[0-9]

matches any digit. Multiple ranges are allowed. E.g., the list [A-Za-z0-9] is a common way to express the idea of “all alphanumeric characters.”

To include one of the characters ‘\’, ‘]’, ‘-’ or ‘^’ in a character list, put a ‘\’ in front of it. For example:

[d\]]

matches either ‘d’, or ‘]’.

Character classes are a new feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but where the actual characters themselves can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs in the U.S.A. and in France.

A character class is only valid in a regexp inside the brackets of a character list. Character classes consist of ‘[:’, a keyword denoting the class, and ‘:]’. Here are the character classes defined by the POSIX standard:

[:alnum:]: Alphanumeric characters.
[:alpha:]: Alphabetic characters.
[:blank:]: Space and tab characters.
[:cntrl:]: Control characters.
[:digit:]: Numeric characters.
[:graph:]: Characters that are printable and are also visible⁷⁶.
[:lower:]: Lower-case alphabetic characters.
[:print:]: Printable characters⁷⁷.
[:punct:]: Punctuation characters⁷⁸.
[:space:]: Space characters⁷⁹.
[:upper:]: Upper-case alphabetic characters.
[:xdigit:]: Characters that are hexadecimal digits.

For example, before the POSIX standard, to match alphanumeric characters, you had to write [A-Za-z0-9]. If your character set had other alphabetic characters in it, this would not match them. With the POSIX character classes, you can write [[:alnum:]], and this will match all the alphabetic and numeric characters in your character set.

Two additional special sequences can appear in character lists. These apply to non-ASCII character sets, which can have single symbols (called collating elements) that are represented with more than one character, as well as several characters that are equivalent for collating, or sorting, purposes. (E.g., in French, a plain ‘e’ and a grave-accented ‘è’ are equivalent.)

Collating Symbols: A collating symbol is a multi-character collating element enclosed in ‘[.’ and ‘.]’. For example, if ‘ch’ is a collating element, then [[.ch.]] is a regexp that matches this collating element, while [ch] is a regexp that matches either ‘c’ or ‘h’.
Equivalence Classes: An equivalence class is a list of equivalent characters enclosed in ‘[=’ and ‘=]’. Thus, [[=eè=]] is a regexp that matches either ‘e’ or ‘è’.

These features are very valuable in non-English speaking locales.

Caution:
The library functions that Gcal uses for regular expression matching currently only recognize POSIX character classes (possibly); they do not recognize collating symbols or equivalence classes.

[^ …]

This is a negated character list respectively complemented character list. The first character after the ‘[’ must be a ‘^’. It matches any characters except those in the square brackets. For example:

[^0-9]

matches any character that is not a digit.

|

This is the alternation operator, and it is used to specify alternatives. For example:

^P|[0-9]

matches any string that matches either ‘^P’ or ‘[0-9]’. This means it matches any string that starts with ‘P’ or contains a digit.

The alternation applies to the largest possible regexps on either side. In other words, ‘|’ has the lowest precedence of all the regular expression operators.

(…)

Parentheses are used for grouping in regular expressions as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, ‘|’. For example, ‘@(samp|code)\{[^}]+\}’ matches both ‘@code{foo}’ and ‘@samp{bar}’. (These are Texinfo formatting control sequences.)

*

This symbol means that the preceding regular expression is to be repeated as many times as necessary to find a match. For example:

ph*

applies the ‘*’ symbol to the preceding ‘h’ and looks for matches of one ‘p’ followed by any number of ‘h’s. This will also match just ‘p’ if no ‘h’s are present.

The ‘*’ repeats the smallest possible preceding expression. (Use parentheses if you wish to repeat a larger expression.) It finds as many repetitions as possible. For example:

gcal --filter-text='\(c[ad][ad]*r x\)' -f sample.rc -y

prints every fixed date in sample.rc containing a fixed date text of the form ‘(car x)’, ‘(cdr x)’, ‘(cadr x)’, and so on. Notice the escaping of the parentheses by preceding them with backslashes.

+

This symbol is similar to ‘*’, but the preceding expression must be matched at least once. This means that:

wh+y

would match ‘why’ and ‘whhy’ but not ‘wy’, whereas ‘wh*y’ would match all three of these strings. This is a simpler way of writing the last ‘*’ example:

gcal --filter-text='\(c[ad]+r x\)' -f sample.rc -y

?

This symbol is similar to ‘*’, but the preceding expression can be matched either once or not at all. For example:

fe?d

will match ‘fed’ and ‘fd’, but nothing else.

{n}

{n,}

{n,m}

One or two numbers inside braces denote an interval expression which is available in the POSIX standard. If there is one number in the braces, the preceding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. If there is one number followed by a comma, then the preceding regexp is repeated at least n times.

wh{3}y: matches ‘whhhy’ but not ‘why’ or ‘whhhhy’.
wh{3,5}y: matches ‘whhhy’ or ‘whhhhy’ or ‘whhhhhy’, only.
wh{2,}y: matches ‘whhy’ or ‘whhhy’, and so on.

GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described here.

Most of the additional operators are for dealing with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_’).

\w: This operator matches any word-constituent character, i.e. any letter, digit, or underscore. Think of it as a short-hand for [A-Za-z0-9_] or [[:alnum:]_].
\W: This operator matches any character that is not word-constituent. Think of it as a short-hand for [^A-Za-z0-9_] or [^[:alnum:]_].
\<: This operator matches the empty string at the beginning of a word. For example, \<away matches ‘away’, but not ‘stowaway’.
\>: This operator matches the empty string at the end of a word. For example, stow\> matches ‘stow’, but not ‘stowaway’.
\b: This operator matches the empty string at either the beginning or the end of a word (the word boundary). For example, ‘\bballs?\b’ matches either ‘ball’ or ‘balls’ as a separate word.
\B: This operator matches the empty string within a word. In other words, ‘\B’ matches the empty string that occurs between two word-constituent characters. For example, \Brat\B matches ‘crate’, but it does not match ‘dirty rat’. ‘\B’ is essentially the opposite of ‘\b’.

There are two other operators that work on buffers. In Emacs, a buffer is, naturally, an Emacs buffer. For other programs, the regexp library routines that Gcal uses consider the entire string to be matched as the buffer⁸⁰.

For Gcal, since ‘^’ and ‘$’ always work in terms of the beginning and end of strings, these operators do not add any new capabilities. They are provided for compatibility with other GNU software.

\`: This operator matches the empty string at the beginning of the buffer.
\': This operator matches the empty string at the end of the buffer.

In regular expressions, the ‘*’, ‘+’, and ‘?’ operators, as well as the braces ‘{’ and ‘}’, have the highest precedence, followed by concatenation, and finally by ‘|’. As in arithmetic, parentheses can change how operators are grouped.

Case is normally significant in regular expressions, both when matching ordinary characters (i.e. not metacharacters), and inside character sets. Thus a ‘w’ in a regular expression matches only a lower-case ‘w’ and not an upper-case ‘W’.

The simplest way to do a case-independent match is to use a character list: ‘[Ww]’. However, this can be cumbersome if you need to use it often; and unfortunately, it can make the regular expressions harder to read. Supplying a want, Gcal offers the --ignore-case option which ignores all case distinctions in both the regular expression and the completely expanded text of each valid fixed date. See Fixed date option --ignore-case.