Information for GNU grep developers

1 Generic GNU information

A good start is to read the GNU coding standards and the Information for maintainers of GNU software.

2 Mailing lists

GNU grep's mailing lists are hosted on lists.gnu.org.

2.1 The `bug-grep` mailing list

To report bugs, suggest features, ask questions, or help in the development of GNU grep, please send email to the bug-grep mailing list. You can attach bug fixes and patches to your email. To save time, you may want to first look at GNU grep's bug report log to see whether the bug has already been reported. If you see, for example, that Bug#16979 is similar to the symptoms you observe, you can follow up to that bug report by sending email to <16292@debbugs.gnu.org>.

Before contributing significant changes to GNU grep, the Free Software Foundation (FSF) requires that you sign copyright assignment papers. Therefore, if you have not already done so and are not willing or able to, it may be better then to just describe bugs or proposed features rather than post actual code (or documentation), as they would then have to be rewritten anyway.

2.2 The `grep-commit` mailing list

The grep-commit read-only mailing list tracks all changes made to GNU grep.

2.3 Other deprecated mailing lists

Older GNU grep releases directed users to the bug-gnu-utils mailing list. As a consequence, some still post their bug reports and questions there. For this reason, it is a good idea for GNU grep developers to monitor this mailing list and follow up on related threads started there by redirecting them to the bug-grep mailing list. New threads about GNU grep should not be intentionally started there.

3 Project page on Savannah

The Savannah project page for GNU grep features development-related tools.

4 Git repository

4.1 Source code

See the Savannah web page about the Git repository for GNU grep's source code.

4.2 Web site

See the Savannah web page about the CVS repository for GNU grep's web pages.

4.3 Tools

Developers with write access to the repositories will need to create an account on Savannah and upload their SSH public identity information there.

6 Release procedure

A number of tasks must be performed before every release. See README-release .

6.1 Source code compatibility with GNU awk

Drop dfa.[ch] into a copy of gawk and run “make check”. This step will soon be obsolete: we're syncing the two dfa.c files.

7 To do

7.1 Other implementations

See this list of grep implementations.

Take a look at these and consider opportunities for merging or cloning:

ja-grep's mlb2 patch (Japanese grep);
lgrep (from lv, a Powerful Multilingual File Viewer / Grep);
pcregrep (from the Perl-Compatible Regular Expressions [PCRE] library);
cgrep (Context grep) seems like nice work;
sgrep (Struct grep);
agrep (Approximate grep), from glimpse;
nr-grep (Nondeterministic reverse grep);
ggrep (Grouse grep);
grep.py (Python grep);
freegrep (a BSD-licensed grep for those who can't stand the GNU GPL).

7.2 POSIX

In general, interesting things to check in POSIX/OpenGroup include:

Provide support for the POSIX [= =] and [. .] constructs. This is difficult because it requires locale-dependent details of the character set and collating sequence, but POSIX does not standardize any method for accessing this information!
Moving away from GNU regex API for POSIX regex API.

7.2.1 POSIX and `--ignore-case`

For this issue, interesting things to check in POSIX include:

Volume “Base Definitions (XBD)”, Chapter “Regular Expressions” and in particular Section “Regular Expression General Requirements” and its paragraph about caseless matching (note that this may not have been fully thought through and that this text may be self-contradicting [specifically: “of either data or patterns” versus all the rest]).

In particular, consider the following with POSIX' approach on case folding in mind. Assume a non-Turkic locale with a character repertoire reduced to the following various forms of “LATIN LETTER I”:

0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049

First note the differing UTF-8 octet lengths of U+0049 (0x49) and U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This implies that whole UTF-8 strings cannot be case-converted in place, using the same memory buffer, and that the needed octet-size of the new buffer cannot merely be guessed.

We have

lc(I) = i, uc(I) = I
lc(i) = i, uc(i) = I
lc(İ) = i, uc(İ) = İ
lc(ı) = ı, uc(ı) = I

where lc() and uc() denote lower-case and upper-case conversions.

There are several candidate --ignore-case logics (including the one mandated by POSIX):

Using the

if (lc(input_wchar) == lc(pattern_wchar))

logic leads to the following matches:

  \in  I  i  İ  ı
pat\   ----------
"I" |  Y  Y  Y  n
"i" |  Y  Y  Y  n
"İ" |  Y  Y  Y  n
"ı" |  n  n  n  Y

There is a lack of symmetry between CAPITAL and SMALL LETTERs with this.

Using the

if (uc(input_wchar) == uc(pattern_wchar))

logic leads to the following matches:

  \in  I  i  İ  ı
pat\   ----------
"I" |  Y  Y  n  Y
"i" |  Y  Y  n  Y
"İ" |  n  n  Y  n
"ı" |  Y  Y  n  Y

There is a lack of symmetry between CAPITAL and SMALL LETTERs with this.

Using the
```
if (   lc(input_wchar) == lc(pattern_wchar)
    || uc(input_wchar) == uc(pattern_wchar))
```
logic leads to the following matches:
```
  \in  I  i  İ  ı
pat\   ----------
"I" |  Y  Y  Y  Y
"i" |  Y  Y  Y  Y
"İ" |  Y  Y  Y  n
"ı" |  Y  Y  n  Y
```
There is some elegance and symmetry with this. But there are potentially two conversions to be made per input character. If the pattern is pre-converted, two copies of it need to be kept and used in a mutually coherent fashion.
Using the
```
if (      input_wchar  == pattern_wchar
    || lc(input_wchar) == pattern_wchar
    || uc(input_wchar) == pattern_wchar)
```
logic (as mandated by POSIX) leads to the following matches:
```
  \in  I  i  İ  ı
pat\   ----------
"I" |  Y  Y  n  Y
"i" |  Y  Y  Y  n
"İ" |  n  n  Y  n
"ı" |  n  n  n  Y
```
There is a different CAPITAL/SMALL symmetry with this. But there's also a loss of pattern/input symmetry that's unique to it. Also there are potentially two conversions to be made per input character.
Using the
```
if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
```
logic leads to the following matches:
```
  \in  I  i  İ  ı
pat\   ----------
"I" |  Y  Y  Y  Y
"i" |  Y  Y  Y  Y
"İ" |  Y  Y  Y  Y
"ı" |  Y  Y  Y  Y
```
This shows total symmetry and transitivity (at least in this example analysis). There are two conversions to be made per input character, but support could be added for having a single straight mapping performing a composition of the two conversions.

Any optimization in the implementation of each logic must not change its basic semantic.

7.3 Unicode

In general, interesting things to check in Unicode include:

Unicode Technical Standard #18 (“Unicode Regular Expressions”).
Unicode Standard Annex #15 (“Unicode Normalization Forms”).

7.3.1 Unicode and `--ignore-case`

For this issue, interesting things to check in Unicode include:

The Unicode Standard, Chapter 3 (“Conformance”), Section 3.13 (“Default Case Operations”) and the toCasefold() case conversion operation.
The Unicode Standard, Chapter 4 (“Character Properties”), Section 4.2 (“Case—Normative”) and the SpecialCasing.txt and CaseFolding.txt files from the Unicode Character Database.
The Unicode Standard, Chapter 5 (“Implementation Guidelines”), Section 5.18 (“Case Mappings”), Subsection “Caseless Matching”.
The Unicode case charts.

Unicode uses the

if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))

logic for caseless matching. Let's consider the “LATIN LETTER I” example mentioned above. In a non-Turkic locale, simple case folding yields

toCasefold_simple(U+0049) = U+0069
toCasefold_simple(U+0069) = U+0069
toCasefold_simple(U+0130) = U+0130
toCasefold_simple(U+0131) = U+0131

which leads to the following matches:

  \in  I  i  İ  ı
pat\   ----------
"I" |  Y  Y  n  n
"i" |  Y  Y  n  n
"İ" |  n  n  Y  n
"ı" |  n  n  n  Y

This is different from anything so far!

In a non-Turkic locale, full case folding yields

toCasefold_full(U+0049) = U+0069
toCasefold_full(U+0069) = U+0069
toCasefold_full(U+0130) = <U+0069, U+0307>
toCasefold_full(U+0131) = U+0131

with

0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;

which leads to the following matches:

  \in  I  i  İ  ı
pat\   ----------
"I" |  Y  Y  *  n
"i" |  Y  Y  *  n
"İ" |  n  n  Y  n
"ı" |  n  n  n  Y

This is just sad!

Note that having toCasefold(U+0131), simple or full, map to itself instead of U+0069 is in contradiction with the rules of Section 5.18 of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is U+0069. The justification for the weird toCasefold_full(U+0130) mapping is unknown; it doesn't even make sense to add a dot (U+0307) to a letter that already has one (U+0069). It would have been so simple to put them all in the same equivalence class!

Otherwise, also consider the following problem with Unicode's approach on case folding in mind. Assume that we want to perform

echo 'AßBC | grep -i 'Sb'

which corresponds to

input:    U+0041 U+00DF U+0042 U+0043 U+000A
pattern:  U+0053 U+0062

Following “CaseFolding-4.1.0.txt”, applying the toCasefold() transformation to these yields

input:    U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
pattern:                U+0073 U+0062

so, according to this approach, the input should match the pattern. As long as the original input line is to be reported to the user as a whole, there is no problem (from the user's point-of-view; implementation is complicated by this).

However, consider both these GNU extensions:

echo 'AßBC' | grep -i --only-matching 'Sb'
echo 'AßBC' | grep -i --color=always  'Sb'

What is to be reported in these cases, since the match begins in the middle of the original input character 'ß'?

Note that Unicode's toCasefold() cannot be implemented in terms of POSIX' towctrans() since that can only return a single wint_t value per input wint_t value.

7.4 Miscellaneous

Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one binary. Is there a possibility of doing even better by automatically checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F 0x9D for compress, and 0x42 0x5A 0x68 for bzip2)?
Lazy dynamic linking of libpcre, libz, and libbz2?
Texinfo documentation: Info documents are also supposed to contain a tutorial and examples.
Fix the DFA matcher to never use exponential space. (Fortunately, these cases are rare.)
Improve the performance of the regex backtracking matcher. This matcher is agonizingly slow and is responsible for grep sometimes being slower than UNIX grep when backreferences are used.
Some test in tests/spencer2.tests should have failed! Need to filter out some bugs in dfa.[ch]/regex.[ch].
Threads for grep?
GNU grep does 32-bit arithmetic, it needs to move to 64-bit.
Clean up, too many #ifdefs!
Check some new Algorithms for matching; talk to Karl Berry and Nelson. Sunday's "Quick Search" Algorithm (CACM 33, 1990-08-08, pp. 132–142) claims that his algorithm is faster than Boyer-Moore. Worth checking.
Better and faster!

8 Distributors

The purpose of this listing is to help GNU grep maintainers track down bug fixes and improvements made by distributors so they can be integrated back into the upstream releases from GNU, if appropriate.

Users should not use this listing to find a substitute target where to send their bugs reports. These are still best sent upstream, to the GNU grep team, through the use of the bug-grep@gnu.org mailing list or of the GNU grep project page on Savannah.

This listing is not exhaustive; priority is given to listing distributors who actually maintain patches to the upstream package from GNU.

Please keep this listing sorted by entry. Each field type may appear more than once if appropriate, the field order being significant.

Debian GNU/Linux

Web site	http://www.debian.org/
Package database entry	Old stable http://packages.debian.org/oldstable/base/grep
Maintainer	Robert van der Meulen `<rvdm at debian.org>`
Package database entry	Stable http://packages.debian.org/stable/base/grep
Maintainer	Ryan M. Golbeck `<rmgolbeck at debian.org>`
Maintainer	Jeff Bailey `<jbailey at nisa.net>`
Package database entry	Testing http://packages.debian.org/testing/base/grep
Package database entry	Unstable http://packages.debian.org/unstable/base/grep
Maintainer	Anibal Monsalve Salazar `<anibal at debian.org>`
Maintainer	Santiago Ruano Rincon `<santiago at unicauca.edu.co>`
Bug tracking	http://bugs.debian.org/grep
Source package name	grep
Binary package name	grep
Entry updated	2005-11-08

Fedora Core/Red Hat

Web site	http://fedora.redhat.com/
Web site	http://www.redhat.com/
Maintainer	Tim Waugh `<twaugh at redhat.com>`
Bug tracking	Red Hat Bugzilla http://bugzilla.redhat.com/
Managed repository	`cvs -d:pserver:anonymous@cvs.fedora.redhat.com:/cvs/dist co devel/grep`
Managed repository	http://cvs.fedora.redhat.com/viewcvs/devel/grep/
Source package name	grep
Binary package name	grep
Entry updated	2005-05-05

FreeBSD

Web site	http://www.freebsd.org/
Bug tracking	http://www.freebsd.org/cgi/query-pr-summary.cgi?query
Managed repository	`CVS_RSH=ssh cvs -d:ext:freebsdanoncvs@anoncvs.FreeBSD.org:/home/ncvs co src/gnu/usr.bin/grep`
Managed repository	http://www.freebsd.org/cgi/cvsweb.cgi/src/gnu/usr.bin/grep/
Entry updated	2005-05-05

Gentoo Linux

Web site	http://www.gentoo.org/
Package database entry	http://packages.gentoo.org/packages/?category=sys-apps;name=grep
Bug tracking	Gentoo Bugzilla http://bugs.gentoo.org/
Managed repository	http://www.gentoo.org/cgi-bin/viewcvs.cgi/sys-apps/grep/
Source package name	grep
Binary package name	grep
Entry updated	2005-05-05

Mandriva Linux

Web site	http://www.mandrivalinux.com/
Bug tracking	Mandriva Bugzilla http://qa.mandriva.com/
Source package name	grep
Binary package name	grep
Entry updated	2005-05-05

NetBSD

Web site	http://www.netbsd.org/
Package database entry	ftp://ftp.netbsd.org/pub/NetBSD/packages/pkgsrc/textproc/grep/README.html
Bug tracking	http://www.netbsd.org/Misc/query-pr.html
Managed repository	`cvs -d:pserver:anoncvs@anoncvs.NetBSD.org:/cvsroot co pkgsrc/textproc/grep`
Managed repository	http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/textproc/grep/
Source package name	grep
Binary package name	grep
Entry updated	2005-05-05

OpenBSD

Web site	http://www.openbsd.org/
Package database entry	http://www.openbsd.org/3.8_packages/i386/ggrep-2.5.1p1.tgz-long.html
Maintainer	Christian Weisgerber `<naddy at openbsd.org>`
Bug tracking	http://www.openbsd.org/query-pr.html
Managed repository	`cvs -d:pserver:anoncvs@anoncvs1.ca.openbsd.org:/cvs co ports/sysutils/ggrep`
Managed repository	http://www.openbsd.org/cgi-bin/cvsweb/ports/sysutils/ggrep/
Source package name	ggrep
Binary package name	ggrep
Entry updated	2005-11-08

OpenPKG

Web site	http://www.openpkg.org/
Maintainer	Ralf S. Engelschall `<rse at openpkg.org>`
Managed repository	`cvs -d :pserver:anonymous@cvs.openpkg.org:/v/openpkg/cvs co openpkg-src/grep`
Managed repository	`rsync -av rsync://rsync.openpkg.org/openpkg-cvs/openpkg-src/grep/ .`
Managed repository	http://cvs.openpkg.org/dir?d=openpkg-src/grep
Source package name	grep
Binary package name	grep
Entry updated	2005-06-19

SuSE Linux

Web site	http://www.novell.com/linux/suse/
Maintainer	Andreas Schwab `<schwab at suse.de>`
Package database entry	Professional http://www.novell.com/products/linuxpackages/professional/grep.html
Source package name	grep
Binary package name	grep
Entry updated	2005-06-19

Return to GNU grep's main page.

Return to the GNU Project's home page.

Return to the FSF's home page.

Please send inquiries about GNU and the FSF to

Free Software Foundation           Voice:  +1 617 542-5942
51 Franklin Street, Fifth Floor    Fax:    +1 617 542-2652
Boston MA 02110-1301 USA           Email:  gnu@gnu.org

Please send broken links and other web page corrections (or suggestions) to

The GNU Webmasters
webmasters@gnu.org

Please see the Translations README for information on coordinating and submitting translations.

Copyright © 2005, 2015 Free Software Foundation, Inc., 51 Franklin Street, Suite 330, Boston, MA 02110-1301, USA
Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice and the copyright notice are preserved.

Updated: $Date: 2015/03/07 00:30:38 $ (UTC) by $Author: eggert $ (at savannah.gnu.org)