Next: , Up: Regexp Search

12.1 The Regular Expression for sentence-end

The symbol sentence-end is bound to the pattern that marks the end of a sentence. What should this regular expression be?

Clearly, a sentence may be ended by a period, a question mark, or an exclamation mark. Indeed, in English, only clauses that end with one of those three characters should be considered the end of a sentence. This means that the pattern should include the character set:

     [.?!]

However, we do not want forward-sentence merely to jump to a period, a question mark, or an exclamation mark, because such a character might be used in the middle of a sentence. A period, for example, is used after abbreviations. So other information is needed.

According to convention, you type two spaces after every sentence, but only one space after a period, a question mark, or an exclamation mark in the body of a sentence. So a period, a question mark, or an exclamation mark followed by two spaces is a good indicator of an end of sentence. However, in a file, the two spaces may instead be a tab or the end of a line. This means that the regular expression should include these three items as alternatives.

This group of alternatives will look like this:

     \\($\\| \\|  \\)
            ^   ^^
           TAB  SPC

Here, ‘$’ indicates the end of the line, and I have pointed out where the tab and two spaces are inserted in the expression. Both are inserted by putting the actual characters into the expression.

Two backslashes, ‘\\’, are required before the parentheses and vertical bars: the first backslash quotes the following backslash in Emacs; and the second indicates that the following character, the parenthesis or the vertical bar, is special.

Also, a sentence may be followed by one or more carriage returns, like this:

     [
     ]*

Like tabs and spaces, a carriage return is inserted into a regular expression by inserting it literally. The asterisk indicates that the <RET> is repeated zero or more times.

But a sentence end does not consist only of a period, a question mark or an exclamation mark followed by appropriate space: a closing quotation mark or a closing brace of some kind may precede the space. Indeed more than one such mark or brace may precede the space. These require a expression that looks like this:

     []\"')}]*

In this expression, the first ‘]’ is the first character in the expression; the second character is ‘"’, which is preceded by a ‘\’ to tell Emacs the ‘"’ is not special. The last three characters are ‘'’, ‘)’, and ‘}’.

All this suggests what the regular expression pattern for matching the end of a sentence should be; and, indeed, if we evaluate sentence-end we find that it returns the following value:

     sentence-end
          ⇒ "[.?!][]\"')}]*\\($\\|     \\|  \\)[
     ]*"

(Well, not in GNU Emacs 22; that is because of an effort to make the process simpler and to handle more glyphs and languages. When the value of sentence-end is nil, then use the value defined by the function sentence-end. (Here is a use of the difference between a value and a function in Emacs Lisp.) The function returns a value constructed from the variables sentence-end-base, sentence-end-double-space, sentence-end-without-period, and sentence-end-without-space. The critical variable is sentence-end-base; its global value is similar to the one described above but it also contains two additional quotation marks. These have differing degrees of curliness. The sentence-end-without-period variable, when true, tells Emacs that a sentence may end without a period, such as text in Thai.)