Projects relating to cpplib

Note: this writeup represents state as of 2002.

cpplib has largely been completed, and is stable at this point. For GCC versions 3.0 and later, it is linked into the C, C++ and Objective C front ends. Most future work will relate to character set issues, performance enhancements and improving cpplib as a stand-alone library.

Work recently completed

  1. Stand-alone CPP is dead. The compiler front end now handles preprocessed output if necessary.
  2. As many built-in macros as possible have been moved to the front ends, and out of SPECS and cpplib itself (some targets still in progress).
  3. CPP arithmetic is now done to the correct target precision, based upon the selected language standard.
  4. The traditional preprocessor has been integrated into cpplib. At present it is an output-only preprocessor, but it should be fairly simple to modify cpplib so that traditional preprocessing and then tokenization are performed in one invocation.

Greater Coordination with the Front Ends

The integrated preprocessor would benefit from greater integration with the front ends. It still feels like it has been tacked on as an after thought, which is not entirely coincidental.

  1. Character sets that are strict supersets of ASCII are safe to use, but extended characters cannot appear in identifiers. This has to be coordinated with the C and C++ front ends. See character set issues, below.
  2. C99 universal character escapes (\uxxxx, \Uxxxxxxxx) are not recognized in identifiers. Proper support has to be coordinated with the front ends.
  3. Precompiled headers are commonly requested; this entails the ability for cpp to dump out and reload all its internal state. You can get some of this with the debug switches, but not all, and not in a reloadable format. The front end must cooperate also.
  4. Integration of diagnostic reporting. The front ends could use extra information only available to the preprocessor, such as column numbers and macros under expansion. The existing code copies cpplib's internal state into the state used by diagnostic.c, which is better than writing out and processing linemarker commands, but still suboptimal.
  5. If YACC did not insist on assigning its own values for token codes, there would be no need for a translation layer between the codes returned by cpplib and the codes used by the parser. Noises have been made about a recursive-descent parser that could handle all of C, C++, Objective C; if this ever happens, it should use cpplib's token codes.
  6. String concatenation should be handled in the function c_lex in c-lex.c. Then the front ends would not have to jump through hoops to remember to concatenate strings, and we could simplify the parsers a little too.

Potential minor improvements

  1. The file-handling code allocates lots of items with xmalloc. The rest of cpplib is now reasonably efficient in its use of memory; minor improvements are certainly still possible.
  2. There might be room for further improvement of macro expansion performance, although it is now pretty good. For example, we currently pre-expand each argument (if necessary) into its own buffer, replace the arguments in the replacement list with their expansions, and then free up each buffer. It might be better to simply expand the arguments into the final argument-replaced expansion, saving one copy per argument and the need to free the argument expansion buffers. It has the disadvantage that we don't know the size we need to make the token buffer in advance [equally, though, we don't know the size we need to make each expanded argument buffer, either]. In view of this, a further enhancement might then be to permit the list of token pointers that represents the expansion to be made up of more than one run. Then we would just need to append a new run, rather than reallocating the expansion buffer if we overflow its initial bounds.
  3. It might be worth trying to optimize wrapper headers - files containing only an #include of another file, so that they are optimized out on reinclusion. This is more tricky than it may sound - something with heuristics similar to the multiple-include optimization is needed, that handles multiple levels of wrapper headers.

Character set issues

Proper non-ASCII character handling is a hard problem. Users want to be able to write comments and strings in their native language. They want the strings to come out in their native language and not gibberish after translation to object code. Some users also want to use their own alphabet for identifiers in their code. There is no one-to-one or many-to-one map between languages and character set encodings. The subset of ASCII that is included in most modern day character sets does not include all the punctuation C uses; some of the missing punctuation may be present but at a different place than where it is in ASCII. The subset described in ISO646 may not be the smallest subset out there.

At the present time, GCC supports the use of any encoding for source code, as long as it is a strict superset of 7-bit ASCII. By this I mean that all printable (including whitespace) ASCII characters, when they appear as single bytes in a file, stand only for themselves, no matter what the context is. This is true of ISO8859.x, KOI8-R, and UTF8. It is not true of Shift JIS and some other popular Asian character sets. If they are used, GCC may silently mangle the input file. The only known specific example is that a Shift JIS multibyte character ending with 0x5C will be mistaken for a line continuation if it occurs at the end of a line. 0x5C is "\" in ASCII.

Assuming a safe encoding, characters not in the base set listed in the standard (C99 5.2.1) are syntax errors if they appear outside strings, character constants, or comments. In strings and character constants, they are taken literally - converted blindly to numeric codes, or copied to the assembly output verbatim, depending on the context. If you use the C99 \u and \U escapes, you get UTF8, no exceptions. These too are only supported in string and character constants.

We intend to improve this as follows:

  1. cpplib will be reworked so that it can handle any character set in wide use, whether or not it is a strict superset of 7-bit ASCII. This means that cpplib will never confuse non-ASCII characters with C punctuators, comment delimiters, or whatever.
  2. In comments, naturally any character will be permitted to appear.
  3. All Unicode code points which are permitted by C99 Annex D to appear in identifiers, will be accepted in identifiers. All source-file characters which, when translated to Unicode, correspond to permitted code points, will also be accepted. In assembly output, identifiers will be encoded in UTF8, and then reencoded using some mangling scheme if the assembler cannot handle UTF8 identifiers. (Does the new C++ ABI have anything to say about this? What does the Java compiler do?)
    Unicode U+0024 will be permitted in identifiers if and only if $ is permitted.
  4. In strings and character constants, GCC will translate from the character set of the file (selectable on a per-file basis), to the current execution character set (chosen once per compilation). This may or may not be Unicode. UCN escapes will also be converted from Unicode to the execution character set; this happens independent of the source character set.
  5. Each file referenced by the compiler may state its own character set with a #pragma, or rely on the default established by the user with locale or a command-line option. The #pragma, if used, must be the first line in the file. This will not prevent the multiple include optimization from working. GCC will also recognize MULE (Multilingual Emacs) magic comments, byte order marks, and any other reasonable in-band method of specifying a file's character set.

It's worth noting that the standard C library facilities for "multibyte character sets" are not adequate to implement the above. The basic problem is that neither C89 nor C99 gives you any way to specify the character set of a file directly. You can manipulate the "locale," which indirectly specifies the character set, but that's a global change. Further, locale names are not defined by the C standard nor is there any consistent map between them and character sets.

The Single Unix specification, and possibly also POSIX, provide the nl_langinfo and iconv interfaces which mostly circumvent these limitations. We may require these interfaces to be present for complete non-ASCII support to be functional.

One final note: EBCDIC is, and will be, supported as a source character set if and only if GCC is compiled for a host (not a target) which uses EBCDIC natively.