GNU Smalltalk User’s Guide: Regular expressions

Regular expressions, or "regexes", are a sophisticated way to efficiently match patterns of text. If you are unfamiliar with regular expressions in general, see 20.5 Syntax of Regular Expressions in GNU Emacs Manual, for a guide for those who have never used regular expressions.

GNU Smalltalk supports regular expressions in the core image with methods on String.

The GNU Smalltalk regular expression library is derived from GNU libc, with modifications made originally for Ruby to support Perl-like syntax. It will always use its included library, and never the ones installed on your system; this may change in the future in backwards-compatible ways. Regular expressions are currently 8-bit clean, meaning they can work with any ordinary String, but do not support full Unicode, even when package I18N is loaded.

Broadly speaking, these regexes support Perl 5 syntax; register groups ‘()’ and repetition ‘{}’ must not be given with backslashes, and their counterpart literal characters should. For example, ‘\{{1,3}’ matches ‘{’, ‘{{’, ‘{{{’; correspondingly, ‘(a)(\()’ matches ‘a(’, with ‘a’ and ‘(’ as the first and second register groups respectively. GNU Smalltalk also supports the regex modifiers ‘imsx’, as in Perl. You can’t put regex modifiers like ‘im’ after Smalltalk strings to specify them, because they aren’t part of Smalltalk syntax. Instead, use the inline modifier syntax. For example, ‘(?is:abc.)’ is equivalent to ‘[Aa][Bb][Cc](?:.|\n)’.

In most cases, you should specify regular expressions as ordinary strings. GNU Smalltalk always caches compiled regexes, and uses a special high-efficiency caching when looking up literal strings (i.e. most regexes), to hide the compiled Regex objects from most code. For special cases where this caching is not good enough, simply send #asRegex to a string to retrieved a compiled form, which works in all places in the public API where you would specify a regex string. You should always rely on the cache until you have demonstrated that using Regex objects makes a noticeable performance difference in your code.

Smalltalk strings only have one escape, the ‘'’ given by ‘''’, so backslashes used in regular expression strings will be understood as backslashes, and a literal backslash can be given directly with ‘\\’⁸.

The methods on the compiled Regex object are private to this interface. As a public interface, GNU Smalltalk provides methods on String, in the category ‘regex’. There are several methods for matching, replacing, pattern expansion, iterating over matches, and other useful things.

The fundamental operator is #searchRegex:, usually written as #=~, reminiscent of Perl syntax. This method will always return a RegexResults, which you can query for whether the regex matched, the location Interval and contents of the match and any register groups as a collection, and other features. For example, here is a simple configuration file line parser:

| file config |
config := LookupTable new.
file := (File name: 'myapp.conf') readStream.
file linesDo: [:line |
    (line =~ '(\w+)\s*=\s*((?: ?\w+)+)') ifMatched: [:match |
        config at: (match at: 1) put: (match at: 2)]].
file close.
config printNl.

As with Perl, =~ will scan the entire string and answer the leftmost match if any is to be found, consuming as many characters as possible from that position. You can anchor the search with variant messages like #matchRegex:, or of course ^ and $ with their usual semantics if you prefer.

You shouldn’t modify the string while you want a particular RegexResults object matched on it to remain valid, because changes to the matched text may propagate to the RegexResults object.

Analogously to the Perl s operator, GNU Smalltalk provides #replacingRegex:with:. Unlike Perl, GNU Smalltalk employs the pattern expansion syntax of the #% message here. For example,

'The ratio is
16/9.' replacingRegex: '(\d+)/(\d+)' with: '$%1\over%2$'

answers 'The ratio is $16\over9$.'. In place of the g modifier, use the #replacingAllRegex:with: message instead.

One other interesting String message is #onOccurrencesOfRegex:do:, which invokes its second argument, a block, on every successful match found in the receiver. Internally, every search will start at the end of the previous successful match. For example, this will print all the words in a stream:

stream contents onOccurrencesOfRegex: '\w+'
                do: [:each | each match printNl]

Footnotes

(8)

Whereas it must be given as ‘\\\\’ in a literal Emacs Lisp string, for example.

2.2 Regular expression matching

Footnotes

(8)