3.7 Problematic Regular Expressions ¶
Some strings are invalid regular expressions and cause
grep
to issue a diagnostic and fail. For example, ‘xy\1’
is invalid because there is no parenthesized subexpression for the
back-reference ‘\1’ to refer to.
Also, some regular expressions have unspecified behavior and
should be avoided even if grep
does not currently diagnose
them. For example, ‘xy\0’ has unspecified behavior because
‘0’ is not a special character and ‘\0’ is not a special
backslash expression (see Special Backslash Expressions).
Unspecified behavior can be particularly problematic because the set
of matched strings might be only partially specified, or not be
specified at all, or the expression might even be invalid.
The following regular expression constructs are invalid on all
platforms conforming to POSIX, so portable scripts can assume that
grep
rejects these constructs:
- A basic regular expression containing a back-reference ‘\n’
preceded by fewer than n closing parentheses. For example,
‘\(a\)\2’ is invalid.
- A bracket expression containing ‘[:’ that does not start a
character class; and similarly for ‘[=’ and ‘[.’. For
example, ‘[a[:b]’ and ‘[a[:ouch:]b]’ are invalid.
GNU grep
treats the following constructs as invalid.
However, other grep
implementations might allow them, so
portable scripts should not rely on their being invalid:
- Unescaped ‘\’ at the end of a regular expression.
- Unescaped ‘[’ that does not start a bracket expression.
- A ‘\{’ in a basic regular expression that does not start an
interval expression.
- A basic regular expression with unbalanced ‘\(’ or ‘\)’,
or an extended regular expression with unbalanced ‘(’.
- In the POSIX locale, a range expression like ‘z-a’ that
represents zero elements. A non-GNU
grep
might treat it as
a valid range that never matches.
- An interval expression with a repetition count greater than 32767.
(The portable POSIX limit is 255, and even interval expressions with
smaller counts can be impractically slow on all known implementations.)
- A bracket expression that contains at least three elements, the first
and last of which are both ‘:’, or both ‘.’, or both
‘=’. For example, a non-GNU
grep
might treat
‘[:alpha:]’ like ‘[[:alpha:]]’, or like ‘[:ahlp]’.
The following constructs have well-defined behavior in GNU
grep
. However, they have unspecified behavior elsewhere, so
portable scripts should avoid them:
- Special backslash expressions like ‘\b’, ‘\<’, and ‘\]’.
See Special Backslash Expressions.
- A basic regular expression that uses ‘\?’, ‘\+’, or ‘\|’.
- An extended regular expression that uses back-references.
- An empty regular expression, subexpression, or alternative. For
example, ‘(a|bc|)’ is not portable; a portable equivalent is
‘(a|bc)?’.
- In a basic regular expression, an anchoring ‘^’ that appears
directly after ‘\(’, or an anchoring ‘$’ that appears
directly before ‘\)’.
- In a basic regular expression, a repetition operator that
directly follows another repetition operator.
- In an extended regular expression, unescaped ‘{’
that does not begin a valid interval expression.
GNU
grep
treats the ‘{’ as an ordinary character.
- A null character or an encoding error in either pattern or input data.
See Character Encoding.
- An input file that ends in a non-newline character,
where GNU
grep
silently supplies a newline.
The following constructs have unspecified behavior, in both GNU
and other grep
implementations. Scripts should avoid
them whenever possible.
- A backslash escaping an ordinary character, unless it is a
back-reference like ‘\1’ or a special backslash expression like
‘\<’ or ‘\b’. See Special Backslash Expressions. For
example, ‘\x’ has unspecified behavior now, and a future version
of
grep
might specify ‘\x’ to have a new behavior.
- A repetition operator that appears directly after an anchor, or at the
start of a complete regular expression, parenthesized subexpression,
or alternative. For example, ‘+|^*(+a|?-b)’ has unspecified
behavior, whereas ‘\+|^\*(\+a|\?-b)’ is portable.
- A range expression outside the POSIX locale. For example, in some
locales ‘[a-z]’ might match some characters that are not
lowercase letters, or might not match some lowercase letters, or might
be invalid. With GNU
grep
it is not documented whether
these range expressions use native code points, or use the collating
sequence specified by the LC_COLLATE
category, or have some
other interpretation. Outside the POSIX locale, it is portable to use
‘[[:lower:]]’ to match a lower-case letter, or
‘[abcdefghijklmnopqrstuvwxyz]’ to match an ASCII lower-case
letter.