Character Encoding (GNU Grep 3.12)

Next: Matching Non-ASCII and Non-printable Characters, Previous: Problematic Regular Expressions, Up: Regular Expressions [Contents][Index]

3.8 Character Encoding ¶

The LC_CTYPE locale specifies the encoding of characters in patterns and data, that is, whether text is encoded in UTF-8, ASCII, or some other encoding. See Environment Variables.

In the ‘C’ or ‘POSIX’ locale, every character is encoded as a single byte and every byte is a valid character. In more-complex encodings such as UTF-8, a sequence of multiple bytes may be needed to represent a character, and some bytes may be encoding errors that do not contribute to the representation of any character. POSIX does not specify the behavior of grep when patterns or input data contain encoding errors or null characters, so portable scripts should avoid such usage. As an extension to POSIX, GNU grep treats null characters like any other character. However, unless the -a (--binary-files=text) option is used, the presence of null characters in input or of encoding errors in output causes GNU grep to treat the file as binary and suppress details about matches. See File and Directory Selection.

Regardless of locale, the 103 characters in the POSIX Portable Character Set (a subset of ASCII) are always encoded as a single byte, and the 128 ASCII characters have their usual single-byte encodings on all but oddball platforms.