Here is a complicated regexp which was formerly used by Emacs to
recognize the end of a sentence together with any whitespace that
follows. (Nowadays Emacs uses a similar but more complex default
regexp constructed by the function
See Standard Regular Expressions Used in Editing.)
Below, we show first the regexp as a string in Lisp syntax (to distinguish spaces from tab characters), and then the result of evaluating it. The string constant begins and ends with a double-quote. ‘\"’ stands for a double-quote as part of the string, ‘\\’ for a backslash as part of the string, ‘\t’ for a tab and ‘\n’ for a newline.
"[.?!]\"')}]*\\($\\| $\\|\t\\| \\)[ \t\n]*" ⇒ "[.?!]\"')}]*\\($\\| $\\| \\| \\)[ ]*"
In the output, tab and newline appear as themselves.
This regular expression contains four parts in succession and can be deciphered as follows:
The first part of the pattern is a character alternative that matches any one of three characters: period, question mark, and exclamation mark. The match must begin with one of these three characters. (This is one point where the new default regexp used by Emacs differs from the old. The new value also allows some non-ASCII characters that end a sentence without any following whitespace.)
The second part of the pattern matches any closing braces and quotation
marks, zero or more of them, that may follow the period, question mark
or exclamation mark. The
\" is Lisp syntax for a double-quote in
a string. The ‘*’ at the end indicates that the immediately
preceding regular expression (a character alternative, in this case) may be
repeated zero or more times.
\\($\\| $\\|\t\\| \\)
The third part of the pattern matches the whitespace that follows the end of a sentence: the end of a line (optionally with a space), or a tab, or two spaces. The double backslashes mark the parentheses and vertical bars as regular expression syntax; the parentheses delimit a group and the vertical bars separate alternatives. The dollar sign is used to match the end of a line.
Finally, the last part of the pattern matches any additional whitespace beyond the minimum needed to end a sentence.
rx notation (see The
rx Structured Regexp Notation), the regexp could be written
(rx (any ".?!") ; Punctuation ending sentence. (zero-or-more (any "\"')]}")) ; Closing quotes or brackets. (or line-end (seq " " line-end) "\t" " ") ; Two spaces. (zero-or-more (any "\t\n "))) ; Optional extra whitespace.
rx regexps are just S-expressions, they can be formatted
and commented as such.