Unicode Support (Units: A Unit Conversion Program and Scientific Calculator)

18 Unicode Support

The standard units data file is in Unicode, using UTF-8 encoding. Most definitions use only ASCII characters (i.e., code points U+0000 through U+007F); definitions using non-ASCII characters appear in blocks beginning with ‘!utf8’ and ending with ‘!endutf8’.

The non-ASCII definitions are loaded only if the platform and the locale support UTF-8. Platform support is determined when units is compiled; the locale is checked at every invocation of units. To see if your version of units includes Unicode support, invoke the program with the --version option.

When Unicode support is available, units checks every line within UTF-8 blocks in all of the units data files for invalid or non-printing UTF-8 sequences; if such sequences occur, units ignores the entire line. In addition to checking validity, units determines the display width of non-ASCII characters to ensure proper positioning of the pointer in some error messages and to align columns for the ‘search’ and ‘?’ commands.

As of early 2019, Microsoft Windows provides limited support for UTF-8 in console applications, and accordingly, units does not support Unicode on Windows. The UTF-16 and UTF-32 encodings are not supported on any platforms.

If Unicode support is available and definitions that contain non-ASCII UTF-8 characters are added to a units data file, those definitions should be enclosed within ‘!utf8’ … ‘!endutf8’ to ensure that they are only loaded when Unicode support is available. As usual, the ‘!’ must appear as the first character on the line. As discussed in Units Data Files, it’s usually best to put such definitions in supplemental data files linked by an ‘!include’ command or in a personal units data file.

When Unicode support is not available, units makes no assumptions about character encoding, except that characters in the range 00–7F hexadecimal correspond to ASCII encoding. Non-ASCII characters are simply sequences of bytes, and have no special meanings; for definitions in supplementary units data files, you can use any encoding consistent with this assumption. For example, if you wish to use non-ASCII characters in definitions when running units under Windows, you can use a character set such as Windows “ANSI” (code page 1252 in the US and Western Europe); if this is done, the console code page must be set to the same encoding for the characters to display properly. You can even use UTF-8, though some messages may be improperly aligned, and units will not detect invalid UTF-8 sequences. If you use UTF-8 encoding when Unicode support is not available, you should place any definitions with non-ASCII characters outside ‘!utf8’ … ‘!endutf8’ blocks—otherwise, they will be ignored.

Except for code examples, typeset material usually uses the Unicode symbols for mathematical operators. To facilitate copying and pasting from such sources, several typographical characters are converted to the ASCII operators used in units: the figure dash (U+2012), minus (‘-’; U+2212), and en dash (‘–’; U+2013) are converted to the operator ‘-’; the multiplication sign (‘×’; U+00D7), N-ary times operator (U+2A09), dot operator (‘⋅’; U+22C5), and middle dot (‘·’; U+00B7) are converted to the operator ‘*’; the division sign (‘÷’; U+00F7) is converted to the operator ‘/’; and the fraction slash (U+2044) is converted to the operator ‘|’.