Next: , Previous: Database, Up: Packages

3.4 Internationalization and localization support

Different countries and cultures have varying conventions for how to communicate. These conventions range from very simple ones, such as the format for representing dates and times, to very complex ones, such as the language spoken. Provided the programs are written to obey the choice of conventions, they will follow the conventions preferred by the user. gnu Smalltalk provides two packages to ease you in doing so. The I18N package covers both internationalization and multilingualization; the lighter-weight Iconv package covers only the latter, as it is a prerequisite for correct internationalization.

Multilingualizing software means programming it to be able to support languages from every part of the world. In particular, it includes understanding multi-byte character sets (such as UTF-8) and Unicode characters whose code point (the equivalent of the ASCII value) is above 127. To this end, gnu Smalltalk provides the UnicodeString class that stores its data as 32-bit Unicode values. In addition, Character will provide support for all the over one million available code points in Unicode.

Loading the I18N package improves this support through the EncodedStream class1, which interprets and transcodes non-ASCII Unicode characters. This support is mostly transparent, because the base classes Character, UnicodeCharacter and UnicodeString are enhanced to use it. Sending asString or printString to an instance of Character and UnicodeString will convert Unicode characters so that they are printed correctly in the current locale. For example, `$<279> printNl' will print a small Latin letter `e' with a dot above, when the I18N package is loaded.

Dually, you can convert String or ByteArray objects to Unicode with a single method call. If the current locale's encoding is UTF-8, `#[196 151] asUnicodeString' will return a Unicode string with the same character as above, the small Latin letter `e' with a dot above.

The implementation of multilingualization support is not yet complete. For example, methods such as asLowercase, asUppercase, isLetter do not yet recognize Unicode characters.

You need to exercise some care, or your program will be buggy when Unicode characters are used. In particular, Characters must not be compared with ==2 and should be printed on a Stream with display: rather than nextPut:.

Also, Characters need to be created with the class method codePoint: if you are referring to their Unicode value; codePoint: is also the only method to create characters that is accepted by the ANSI Standard for Smalltalk. The method value:, instead, should be used if you are referring to a byte in a particular encoding. This subtle difference means that, for example, the last two of the following examples will fail:

         "Correct.  Use #value: with Strings, #codePoint: with UnicodeString."
         String with: (Character value: 65)
         String with: (Character value: 128)
         UnicodeString with: (Character codePoint: 65)
         UnicodeString with: (Character codePoint: 128)
         "Correct.  Only works for characters in the 0-127 range, which may
          be considered as defensive programming."
         String with: (Character codePoint: 65)
         "Dubious, and only works for characters in the 0-127 range.  With
          UnicodeString, probably you always want #codePoint:."
         UnicodeString with: (Character value: 65)
         "Fails, we try to use a high character in a String"
         String with: (Character codePoint: 128)
         "Fails, we try to use an encoding in a Unicode string"
         UnicodeString with: (Character value: 128)

Internationalizing software, instead, means programming it to be able to adapt to the user's favorite conventions. These conventions can get pretty complex; for example, the user might specify the locale `espana-castellano' for most purposes, but specify the locale `usa-english' for currency formatting: this might make sense if the user is a Spanish-speaking American, working in Spanish, but representing monetary amounts in US dollars. You can see that this system is simple but, at the same time, very complete. This manual, however, is not the right place for a thorough discussion of how an user would set up his system for these conventions; for more information, refer to your operating system's manual or to the gnu C library's manual.

gnu Smalltalk inherits from iso C the concept of a locale, that is, a collection of conventions, one convention for each purpose, and maps each of these purposes to a Smalltalk class defined by the I18N package, and these classes form a small hierarchy with class Locale as its roots:

Basic usage of the I18N package involves a single selector, the question mark (?), which is a rarely used yet valid character for a Smalltalk binary message. The meaning of the question mark selector is “How do you say ... under your convention?”. You can send ? to either a specific instance of a subclass of Locale, or to the class itself; in this case, rules for the default locale (which is specified via environment variables) apply. You might say, for example, LcTime ? Date today or, for example, germanMonetaryLocale ? account balance. This syntax can be at first confusing, but turns out to be convenient because of its consistency and overall simplicity.

Here is how ? works for different classes:

— Method on LcTime: ? aString

Format a date, a time or a timestamp (DateTime object).

— Method on LcNumber: ? aString

Format a number.

— Method on LcMonetary: ? aString

Format a monetary value together with its currency symbol.

— Method on LcMonetaryISO: ? aString

Format a monetary value together with its iso currency symbol.

— Method on LcMessages: ? aString

Answer an LcMessagesDomain that retrieves translations from the specified file.

— Method on LcMessagesDomain: ? aString

Retrieve the translation of the given string.3

These two packages provides much more functionality, including more advanced formatting options support for Unicode, and conversion to and from several character sets. For more information, refer to Multilingual and international support with Iconv and I18N.

As an aside, the representation of locales that the package uses is exactly the same as the C library, which has many advantages: the burden of mantaining locale data is removed from gnu Smalltalk's mantainers; the need of having two copies of the same data is removed from gnu Smalltalk's users; and finally, uniformity of the conventions assumed by different internationalized programs is guaranteed to the end user.

In addition, the representation of translated strings is the standard mo file format adopted by the gnu gettext library.


[1] All the classes mentioned in this section reside in the I18N namespace.

[2] Character equality with = will be as fast as with ==.

[3] The ? method does not apply to the LcMessagesDomain class itself, but only to its instances. This is because LcMessagesDomain is not a subclass of Locale.