Input processing (GNU M4 1.4.19 macro processor)

As m4 reads the input token by token, it will copy each token directly to the output immediately.

The exception is when it finds a word with a macro definition. In that case m4 will calculate the macro’s expansion, possibly reading more input to get the arguments. It then inserts the expansion in front of the remaining input. In other words, the resulting text from a macro call will be read and parsed into tokens again.

m4 expands a macro as soon as possible. If it finds a macro call when collecting the arguments to another, it will expand the second call first. This process continues until there are no more macro calls to expand and all the input has been consumed.

format(`Result is %d', eval(`2**15'))

First, m4 sees that the token ‘format’ is a macro name, so it collects the tokens ‘(’, ‘`Result is %d'’, ‘,’, and ‘ ’, before encountering another potential macro. Sure enough, ‘eval’ is a macro name, so the nested argument collection picks up ‘(’, ‘`2**15'’, and ‘)’, invoking the eval macro with the lone argument of ‘2**15’. The expansion of ‘eval(2**15)’ is ‘32768’, which is then rescanned as the five tokens ‘3’, ‘2’, ‘7’, ‘6’, and ‘8’; and combined with the next ‘)’, the format macro now has all its arguments, as if the user had typed:

format(`Result is %d', 32768)

The format macro expands to ‘Result is 32768’, and we have another round of scanning for the tokens ‘Result’, ‘ ’, ‘is’, ‘ ’, ‘3’, ‘2’, ‘7’, ‘6’, and ‘8’. None of these are macros, so the final output is

⇒Result is 32768

As a more complicated example, we will contrast an actual code example from the Gnulib project¹, showing both a buggy approach and the desired results. The user desires to output a shell assignment statement that takes its argument and turns it into a shell variable by converting it to uppercase and prepending a prefix. The original attempt looks like this:

changequote([,])dnl
define([gl_STRING_MODULE_INDICATOR],
  [
    dnl comment
    GNULIB_]translit([$1],[a-z],[A-Z])[=1
  ])dnl
  gl_STRING_MODULE_INDICATOR([strcase])
⇒  
⇒        GNULIB_strcase=1
⇒

Oops – the argument did not get capitalized. And although the manual is not able to easily show it, both lines that appear empty actually contain two trailing spaces. By stepping through the parse, it is easy to see what happened. First, m4 sees the token ‘changequote’, which it recognizes as a macro, followed by ‘(’, ‘[’, ‘,’, ‘]’, and ‘)’ to form the argument list. The macro expands to the empty string, but changes the quoting characters to something more useful for generating shell code (unbalanced ‘`’ and ‘'’ appear all the time in shell scripts, but unbalanced ‘[]’ tend to be rare). Also in the first line, m4 sees the token ‘dnl’, which it recognizes as a builtin macro that consumes the rest of the line, resulting in no output for that line.

The second line starts a macro definition. m4 sees the token ‘define’, which it recognizes as a macro, followed by a ‘(’, ‘[gl_STRING_MODULE_INDICATOR]’, and ‘,’. Because an unquoted comma was encountered, the first argument is known to be the expansion of the single-quoted string token, or ‘gl_STRING_MODULE_INDICATOR’. Next, m4 sees ‘NL’, ‘ ’, and ‘ ’, but this whitespace is discarded as part of argument collection. Then comes a rather lengthy single-quoted string token, ‘[NL dnl commentNL GNULIB_]’. This is followed by the token ‘translit’, which m4 recognizes as a macro name, so a nested macro expansion has started.

The arguments to the translit are found by the tokens ‘(’, ‘[$1]’, ‘,’, ‘[a-z]’, ‘,’, ‘[A-Z]’, and finally ‘)’. All three string arguments are expanded (or in other words, the quotes are stripped), and since neither ‘$’ nor ‘1’ need capitalization, the result of the macro is ‘$1’. This expansion is rescanned, resulting in the two literal characters ‘$’ and ‘1’.

Scanning of the outer macro resumes, and picks up with ‘[=1NL ]’, and finally ‘)’. The collected pieces of expanded text are concatenated, with the end result that the macro ‘gl_STRING_MODULE_INDICATOR’ is now defined to be the sequence ‘NL dnl commentNL GNULIB_$1=1NL ’. Once again, ‘dnl’ is recognized and avoids a newline in the output.

The final line is then parsed, beginning with ‘ ’ and ‘ ’ that are output literally. Then ‘gl_STRING_MODULE_INDICATOR’ is recognized as a macro name, with an argument list of ‘(’, ‘[strcase]’, and ‘)’. Since the definition of the macro contains the sequence ‘$1’, that sequence is replaced with the argument ‘strcase’ prior to starting the rescan. The rescan sees ‘NL’ and four spaces, which are output literally, then ‘dnl’, which discards the text ‘ commentNL’. Next comes four more spaces, also output literally, and the token ‘GNULIB_strcase’, which resulted from the earlier parameter substitution. Since that is not a macro name, it is output literally, followed by the literal tokens ‘=’, ‘1’, ‘NL’, and two more spaces. Finally, the original ‘NL’ seen after the macro invocation is scanned and output literally.

Now for a corrected approach. This rearranges the use of newlines and whitespace so that less whitespace is output (which, although harmless to shell scripts, can be visually unappealing), and fixes the quoting issues so that the capitalization occurs when the macro ‘gl_STRING_MODULE_INDICATOR’ is invoked, rather then when it is defined. It also adds another layer of quoting to the first argument of translit, to ensure that the output will be rescanned as a string rather than a potential uppercase macro name needing further expansion.

changequote([,])dnl
define([gl_STRING_MODULE_INDICATOR],
  [dnl comment
  GNULIB_[]translit([[$1]], [a-z], [A-Z])=1dnl
])dnl
  gl_STRING_MODULE_INDICATOR([strcase])
⇒    GNULIB_STRCASE=1

The parsing of the first line is unchanged. The second line sees the name of the macro to define, then sees the discarded ‘NL’ and two spaces, as before. But this time, the next token is ‘[dnl commentNL GNULIB_[]translit([[$1]], [a-z], [A-Z])=1dnlNL]’, which includes nested quotes, followed by ‘)’ to end the macro definition and ‘dnl’ to skip the newline. No early expansion of translit occurs, so the entire string becomes the definition of the macro.

The final line is then parsed, beginning with two spaces that are output literally, and an invocation of gl_STRING_MODULE_INDICATOR with the argument ‘strcase’. Again, the ‘$1’ in the macro definition is substituted prior to rescanning. Rescanning first encounters ‘dnl’, and discards ‘ commentNL’. Then two spaces are output literally. Next comes the token ‘GNULIB_’, but that is not a macro, so it is output literally. The token ‘[]’ is an empty string, so it does not affect output. Then the token ‘translit’ is encountered.

This time, the arguments to translit are parsed as ‘(’, ‘[[strcase]]’, ‘,’, ‘ ’, ‘[a-z]’, ‘,’, ‘ ’, ‘[A-Z]’, and ‘)’. The two spaces are discarded, and the translit results in the desired result ‘[STRCASE]’. This is rescanned, but since it is a string, the quotes are stripped and the only output is a literal ‘STRCASE’. Then the scanner sees ‘=’ and ‘1’, which are output literally, followed by ‘dnl’ which discards the rest of the definition of gl_STRING_MODULE_INDICATOR. The newline at the end of output is the literal ‘NL’ that appeared after the invocation of the macro.

The order in which m4 expands the macros can be further explored using the trace facilities of GNU m4 (see Trace).

3.5 How `m4` copies input to output

Footnotes

(1)

3.5 How m4 copies input to output

Footnotes

(1)

3.5 How `m4` copies input to output