GNU `gettext' utilities

Table of Contents


Next: , Previous: (dir), Up: (dir)

GNU gettext utilities

This manual documents the GNU gettext tools and the GNU libintl library, version 0.17.

--- The Detailed Node Listing ---

Introduction

The User's View

Setting the POSIX Locale

Preparing Program Sources

Making the PO Template File

Creating a New PO File

Updating Existing PO Files

Editing PO Files

Emacs's PO File Editor

Using Translation Compendia

Manipulating PO Files

Highlighting parts of PO files

Producing Binary MO Files

The Programmer's View

About catgets

About gettext

Temporary Notes for the Programmers Chapter

The Translator's View

Organization

National Teams

The Maintainer's View

Files You Must Create or Alter

Autoconf macros for use in configure.ac

Integrating with CVS

Other Programming Languages

The Translator's View

Individual Programming Languages

sh - Shell Script

Perl

Internationalizable Data

Concluding Remarks

Language Codes

Licenses


Next: , Previous: Top, Up: Top

1 Introduction

This chapter explains the goals sought in the creation of GNU gettext and the free Translation Project. Then, it explains a few broad concepts around Native Language Support, and positions message translation with regard to other aspects of national and cultural variance, as they apply to programs. It also surveys those files used to convey the translations. It explains how the various tools interact in the initial generation of these files, and later, how the maintenance cycle should usually operate.

In this manual, we use he when speaking of the programmer or maintainer, she when speaking of the translator, and they when speaking of the installers or end users of the translated program. This is only a convenience for clarifying the documentation. It is absolutely not meant to imply that some roles are more appropriate to males or females. Besides, as you might guess, GNU gettext is meant to be useful for people using computers, whatever their sex, race, religion or nationality!

Please send suggestions and corrections to:

     Internet address:
         bug-gnu-gettext@gnu.org

Please include the manual's edition number and update date in your messages.


Next: , Previous: Introduction, Up: Introduction

1.1 The Purpose of GNU gettext

Usually, programs are written and documented in English, and use English at execution time to interact with users. This is true not only of GNU software, but also of a great deal of proprietary and free software. Using a common language is quite handy for communication between developers, maintainers and users from all countries. On the other hand, most people are less comfortable with English than with their own native language, and would prefer to use their mother tongue for day to day's work, as far as possible. Many would simply love to see their computer screen showing a lot less of English, and far more of their own language.

However, to many people, this dream might appear so far fetched that they may believe it is not even worth spending time thinking about it. They have no confidence at all that the dream might ever become true. Yet some have not lost hope, and have organized themselves. The Translation Project is a formalization of this hope into a workable structure, which has a good chance to get all of us nearer the achievement of a truly multi-lingual set of programs.

GNU gettext is an important step for the Translation Project, as it is an asset on which we may build many other steps. This package offers to programmers, translators and even users, a well integrated set of tools and documentation. Specifically, the GNU gettext utilities are a set of tools that provides a framework within which other free packages may produce multi-lingual messages. These tools include

GNU gettext is designed to minimize the impact of internationalization on program sources, keeping this impact as small and hardly noticeable as possible. Internationalization has better chances of succeeding if it is very light weighted, or at least, appear to be so, when looking at program sources.

The Translation Project also uses the GNU gettext distribution as a vehicle for documenting its structure and methods. This goes beyond the strict technicalities of documenting the GNU gettext proper. By so doing, translators will find in a single place, as far as possible, all they need to know for properly doing their translating work. Also, this supplemental documentation might also help programmers, and even curious users, in understanding how GNU gettext is related to the remainder of the Translation Project, and consequently, have a glimpse at the big picture.


Next: , Previous: Why, Up: Introduction

1.2 I18n, L10n, and Such

Two long words appear all the time when we discuss support of native language in programs, and these words have a precise meaning, worth being explained here, once and for all in this document. The words are internationalization and localization. Many people, tired of writing these long words over and over again, took the habit of writing i18n and l10n instead, quoting the first and last letter of each word, and replacing the run of intermediate letters by a number merely telling how many such letters there are. But in this manual, in the sake of clarity, we will patiently write the names in full, each time...

By internationalization, one refers to the operation by which a program, or a set of programs turned into a package, is made aware of and able to support multiple languages. This is a generalization process, by which the programs are untied from calling only English strings or other English specific habits, and connected to generic ways of doing the same, instead. Program developers may use various techniques to internationalize their programs. Some of these have been standardized. GNU gettext offers one of these standards. See Programmers.

By localization, one means the operation by which, in a set of programs already internationalized, one gives the program all needed information so that it can adapt itself to handle its input and output in a fashion which is correct for some native language and cultural habits. This is a particularisation process, by which generic methods already implemented in an internationalized program are used in specific ways. The programming environment puts several functions to the programmers disposal which allow this runtime configuration. The formal description of specific set of cultural habits for some country, together with all associated translations targeted to the same native language, is called the locale for this language or country. Users achieve localization of programs by setting proper values to special environment variables, prior to executing those programs, identifying which locale should be used.

In fact, locale message support is only one component of the cultural data that makes up a particular locale. There are a whole host of routines and functions provided to aid programmers in developing internationalized software and which allow them to access the data stored in a particular locale. When someone presently refers to a particular locale, they are obviously referring to the data stored within that particular locale. Similarly, if a programmer is referring to “accessing the locale routines”, they are referring to the complete suite of routines that access all of the locale's information.

One uses the expression Native Language Support, or merely NLS, for speaking of the overall activity or feature encompassing both internationalization and localization, allowing for multi-lingual interactions in a program. In a nutshell, one could say that internationalization is the operation by which further localizations are made possible.

Also, very roughly said, when it comes to multi-lingual messages, internationalization is usually taken care of by programmers, and localization is usually taken care of by translators.


Next: , Previous: Concepts, Up: Introduction

1.3 Aspects in Native Language Support

For a totally multi-lingual distribution, there are many things to translate beyond output messages.

As we already stressed, translation is only one aspect of locales. Other internationalization aspects are system services and are handled in GNU libc. There are many attributes that are needed to define a country's cultural conventions. These attributes include beside the country's native language, the formatting of the date and time, the representation of numbers, the symbols for currency, etc. These local rules are termed the country's locale. The locale represents the knowledge needed to support the country's native attributes.

There are a few major areas which may vary between countries and hence, define what a locale must describe. The following list helps putting multi-lingual messages into the proper context of other tasks related to locales. See the GNU libc manual for details.

Characters and Codesets
The codeset most commonly used through out the USA and most English speaking parts of the world is the ASCII codeset. However, there are many characters needed by various locales that are not found within this codeset. The 8-bit ISO 8859-1 code set has most of the special characters needed to handle the major European languages. However, in many cases, choosing ISO 8859-1 is nevertheless not adequate: it doesn't even handle the major European currency. Hence each locale will need to specify which codeset they need to use and will need to have the appropriate character handling routines to cope with the codeset.
Currency
The symbols used vary from country to country as does the position used by the symbol. Software needs to be able to transparently display currency figures in the native mode for each locale.
Dates
The format of date varies between locales. For example, Christmas day in 1994 is written as 12/25/94 in the USA and as 25/12/94 in Australia. Other countries might use ISO 8601 dates, etc.

Time of the day may be noted as hh:mm, hh.mm, or otherwise. Some locales require time to be specified in 24-hour mode rather than as AM or PM. Further, the nature and yearly extent of the Daylight Saving correction vary widely between countries.

Numbers
Numbers can be represented differently in different locales. For example, the following numbers are all written correctly for their respective locales:
          12,345.67       English
          12.345,67       German
           12345,67       French
          1,2345.67       Asia

Some programs could go further and use different unit systems, like English units or Metric units, or even take into account variants about how numbers are spelled in full.

Messages
The most obvious area is the language support within a locale. This is where GNU gettext provides the means for developers and users to easily change the language that the software uses to communicate to the user.

These areas of cultural conventions are called locale categories. It is an unfortunate term; locale aspects or locale feature categories would be a better term, because each “locale category” describes an area or task that requires localization. The concrete data that describes the cultural conventions for such an area and for a particular culture is also called a locale category. In this sense, a locale is composed of several locale categories: the locale category describing the codeset, the locale category describing the formatting of numbers, the locale category containing the translated messages, and so on.

Components of locale outside of message handling are standardized in the ISO C standard and the POSIX:2001 standard (also known as the SUSV3 specification). GNU libc fully implements this, and most other modern systems provide a more or less reasonable support for at least some of the missing components.


Next: , Previous: Aspects, Up: Introduction

1.4 Files Conveying Translations

The letters PO in .po files means Portable Object, to distinguish it from .mo files, where MO stands for Machine Object. This paradigm, as well as the PO file format, is inspired by the NLS standard developed by Uniforum, and first implemented by Sun in their Solaris system.

PO files are meant to be read and edited by humans, and associate each original, translatable string of a given package with its translation in a particular target language. A single PO file is dedicated to a single target language. If a package supports many languages, there is one such PO file per language supported, and each package has its own set of PO files. These PO files are best created by the xgettext program, and later updated or refreshed through the msgmerge program. Program xgettext extracts all marked messages from a set of C files and initializes a PO file with empty translations. Program msgmerge takes care of adjusting PO files between releases of the corresponding sources, commenting obsolete entries, initializing new ones, and updating all source line references. Files ending with .pot are kind of base translation files found in distributions, in PO file format.

MO files are meant to be read by programs, and are binary in nature. A few systems already offer tools for creating and handling MO files as part of the Native Language Support coming with the system, but the format of these MO files is often different from system to system, and non-portable. The tools already provided with these systems don't support all the features of GNU gettext. Therefore GNU gettext uses its own format for MO files. Files ending with .gmo are really MO files, when it is known that these files use the GNU format.


Previous: Files, Up: Introduction

1.5 Overview of GNU gettext

The following diagram summarizes the relation between the files handled by GNU gettext and the tools acting on these files. It is followed by somewhat detailed explanations, which you should read while keeping an eye on the diagram. Having a clear understanding of these interrelations will surely help programmers, translators and maintainers.

     Original C Sources ───> Preparation ───> Marked C Sources ───╮
                                                                  │
                   ╭─────────<─── GNU gettext Library             │
     ╭─── make <───┤                                              │
     │             ╰─────────<────────────────────┬───────────────╯
     │                                            │
     │   ╭─────<─── PACKAGE.pot <─── xgettext <───╯   ╭───<─── PO Compendium
     │   │                                            │              ↑
     │   │                                            ╰───╮          │
     │   ╰───╮                                            ├───> PO editor ───╮
     │       ├────> msgmerge ──────> LANG.po ────>────────╯                  │
     │   ╭───╯                                                               │
     │   │                                                                   │
     │   ╰─────────────<───────────────╮                                     │
     │                                 ├─── New LANG.po <────────────────────╯
     │   ╭─── LANG.gmo <─── msgfmt <───╯
     │   │
     │   ╰───> install ───> /.../LANG/PACKAGE.mo ───╮
     │                                              ├───> "Hello world!"
     ╰───────> install ───> /.../bin/PROGRAM ───────╯

As a programmer, the first step to bringing GNU gettext into your package is identifying, right in the C sources, those strings which are meant to be translatable, and those which are untranslatable. This tedious job can be done a little more comfortably using emacs PO mode, but you can use any means familiar to you for modifying your C sources. Beside this some other simple, standard changes are needed to properly initialize the translation library. See Sources, for more information about all this.

For newly written software the strings of course can and should be marked while writing it. The gettext approach makes this very easy. Simply put the following lines at the beginning of each file or in a central header file:

     #define _(String) (String)
     #define N_(String) String
     #define textdomain(Domain)
     #define bindtextdomain(Package, Directory)

Doing this allows you to prepare the sources for internationalization. Later when you feel ready for the step to use the gettext library simply replace these definitions by the following:

     #include <libintl.h>
     #define _(String) gettext (String)
     #define gettext_noop(String) String
     #define N_(String) gettext_noop (String)

and link against libintl.a or libintl.so. Note that on GNU systems, you don't need to link with libintl because the gettext library functions are already contained in GNU libc. That is all you have to change.

Once the C sources have been modified, the xgettext program is used to find and extract all translatable strings, and create a PO template file out of all these. This package.pot file contains all original program strings. It has sets of pointers to exactly where in C sources each string is used. All translations are set to empty. The letter t in .pot marks this as a Template PO file, not yet oriented towards any particular language. See xgettext Invocation, for more details about how one calls the xgettext program. If you are really lazy, you might be interested at working a lot more right away, and preparing the whole distribution setup (see Maintainers). By doing so, you spare yourself typing the xgettext command, as make should now generate the proper things automatically for you!

The first time through, there is no lang.po yet, so the msgmerge step may be skipped and replaced by a mere copy of package.pot to lang.po, where lang represents the target language. See Creating for details.

Then comes the initial translation of messages. Translation in itself is a whole matter, still exclusively meant for humans, and whose complexity far overwhelms the level of this manual. Nevertheless, a few hints are given in some other chapter of this manual (see Translators). You will also find there indications about how to contact translating teams, or becoming part of them, for sharing your translating concerns with others who target the same native language.

While adding the translated messages into the lang.po PO file, if you are not using one of the dedicated PO file editors (see Editing), you are on your own for ensuring that your efforts fully respect the PO file format, and quoting conventions (see PO Files). This is surely not an impossible task, as this is the way many people have handled PO files around 1995. On the other hand, by using a PO file editor, most details of PO file format are taken care of for you, but you have to acquire some familiarity with PO file editor itself.

If some common translations have already been saved into a compendium PO file, translators may use PO mode for initializing untranslated entries from the compendium, and also save selected translations into the compendium, updating it (see Compendium). Compendium files are meant to be exchanged between members of a given translation team.

Programs, or packages of programs, are dynamic in nature: users write bug reports and suggestion for improvements, maintainers react by modifying programs in various ways. The fact that a package has already been internationalized should not make maintainers shy of adding new strings, or modifying strings already translated. They just do their job the best they can. For the Translation Project to work smoothly, it is important that maintainers do not carry translation concerns on their already loaded shoulders, and that translators be kept as free as possible of programming concerns.

The only concern maintainers should have is carefully marking new strings as translatable, when they should be, and do not otherwise worry about them being translated, as this will come in proper time. Consequently, when programs and their strings are adjusted in various ways by maintainers, and for matters usually unrelated to translation, xgettext would construct package.pot files which are evolving over time, so the translations carried by lang.po are slowly fading out of date.

It is important for translators (and even maintainers) to understand that package translation is a continuous process in the lifetime of a package, and not something which is done once and for all at the start. After an initial burst of translation activity for a given package, interventions are needed once in a while, because here and there, translated entries become obsolete, and new untranslated entries appear, needing translation.

The msgmerge program has the purpose of refreshing an already existing lang.po file, by comparing it with a newer package.pot template file, extracted by xgettext out of recent C sources. The refreshing operation adjusts all references to C source locations for strings, since these strings move as programs are modified. Also, msgmerge comments out as obsolete, in lang.po, those already translated entries which are no longer used in the program sources (see Obsolete Entries). It finally discovers new strings and inserts them in the resulting PO file as untranslated entries (see Untranslated Entries). See msgmerge Invocation, for more information about what msgmerge really does.

Whatever route or means taken, the goal is to obtain an updated lang.po file offering translations for all strings.

The temporal mobility, or fluidity of PO files, is an integral part of the translation game, and should be well understood, and accepted. People resisting it will have a hard time participating in the Translation Project, or will give a hard time to other participants! In particular, maintainers should relax and include all available official PO files in their distributions, even if these have not recently been updated, without exerting pressure on the translator teams to get the job done. The pressure should rather come from the community of users speaking a particular language, and maintainers should consider themselves fairly relieved of any concern about the adequacy of translation files. On the other hand, translators should reasonably try updating the PO files they are responsible for, while the package is undergoing pretest, prior to an official distribution.

Once the PO file is complete and dependable, the msgfmt program is used for turning the PO file into a machine-oriented format, which may yield efficient retrieval of translations by the programs of the package, whenever needed at runtime (see MO Files). See msgfmt Invocation, for more information about all modes of execution for the msgfmt program.

Finally, the modified and marked C sources are compiled and linked with the GNU gettext library, usually through the operation of make, given a suitable Makefile exists for the project, and the resulting executable is installed somewhere users will find it. The MO files themselves should also be properly installed. Given the appropriate environment variables are set (see Setting the POSIX Locale), the program should localize itself automatically, whenever it executes.

The remainder of this manual has the purpose of explaining in depth the various steps outlined above.


Next: , Previous: Introduction, Up: Top

2 The User's View

Nowadays, when users log into a computer, they usually find that all their programs show messages in their native language – at least for users of languages with an active free software community, like French or German; to a lesser extent for languages with a smaller participation in free software and the GNU project, like Hindi and Filipino.

How does this work? How can the user influence the language that is used by the programs? This chapter will answer it.


Next: , Previous: Users, Up: Users

2.1 Operating System Installation

The default language is often already specified during operating system installation. When the operating system is installed, the installer typically asks for the language used for the installation process and, separately, for the language to use in the installed system. Some OS installers only ask for the language once.

This determines the system-wide default language for all users. But the installers often give the possibility to install extra localizations for additional languages. For example, the localizations of KDE (the K Desktop Environment) and OpenOffice.org are often bundled separately, as one installable package per language.

At this point it is good to consider the intended use of the machine: If it is a machine designated for personal use, additional localizations are probably not necessary. If, however, the machine is in use in an organization or company that has international relationships, one can consider the needs of guest users. If you have a guest from abroad, for a week, what could be his preferred locales? It may be worth installing these additional localizations ahead of time, since they cost only a bit of disk space at this point.

The system-wide default language is the locale configuration that is used when a new user account is created. But the user can have his own locale configuration that is different from the one of the other users of the same machine. He can specify it, typically after the first login, as described in the next section.


Next: , Previous: System Installation, Up: Users

2.2 Setting the Locale Used by GUI Programs

The immediately available programs in a user's desktop come from a group of programs called a “desktop environment”; it usually includes the window manager, a web browser, a text editor, and more. The most common free desktop environments are KDE, GNOME, and Xfce.

The locale used by GUI programs of the desktop environment can be specified in a configuration screen called “control center”, “language settings” or “country settings”.

Individual GUI programs that are not part of the desktop environment can have their locale specified either in a settings panel, or through environment variables.

For some programs, it is possible to specify the locale through environment variables, possibly even to a different locale than the desktop's locale. This means, instead of starting a program through a menu or from the file system, you can start it from the command-line, after having set some environment variables. The environment variables can be those specified in the next section (Setting the POSIX Locale); for some versions of KDE, however, the locale is specified through a variable KDE_LANG, rather than LANG or LC_ALL.


Next: , Previous: Setting the GUI Locale, Up: Users

2.3 Setting the Locale through Environment Variables

As a user, if your language has been installed for this package, in the simplest case, you only have to set the LANG environment variable to the appropriate ‘ll_CC’ combination. For example, let's suppose that you speak German and live in Germany. At the shell prompt, merely execute ‘setenv LANG de_DE (in csh), ‘export LANG; LANG=de_DE (in sh) or ‘export LANG=de_DE (in bash). This can be done from your .login or .profile file, once and for all.


Next: , Previous: Setting the POSIX Locale, Up: Setting the POSIX Locale

2.3.1 Locale Names

A locale name usually has the form ‘ll_CC’. Here ‘ll’ is an ISO 639 two-letter language code, and ‘CC’ is an ISO 3166 two-letter country code. For example, for German in Germany, ll is de, and CC is DE. You find a list of the language codes in appendix Language Codes and a list of the country codes in appendix Country Codes.

You might think that the country code specification is redundant. But in fact, some languages have dialects in different countries. For example, ‘de_AT’ is used for Austria, and ‘pt_BR’ for Brazil. The country code serves to distinguish the dialects.

Many locale names have an extended syntax ‘ll_CC.encoding’ that also specifies the character encoding. These are in use because between 2000 and 2005, most users have switched to locales in UTF-8 encoding. For example, the German locale on glibc systems is nowadays ‘de_DE.UTF-8’. The older name ‘de_DE’ still refers to the German locale as of 2000 that stores characters in ISO-8859-1 encoding – a text encoding that cannot even accomodate the Euro currency sign.

Some locale names use ‘ll_CC.@variant’ instead of ‘ll_CC’. The ‘@variant’ can denote any kind of characteristics that is not already implied by the language ll and the country CC. It can denote a particular monetary unit. For example, on glibc systems, ‘de_DE@euro’ denotes the locale that uses the Euro currency, in contrast to the older locale ‘de_DE’ which implies the use of the currency before 2002. It can also denote a dialect of the language, or the script used to write text (for example, ‘sr_RS@latin’ uses the Latin script, whereas ‘sr_RS’ uses the Cyrillic script to write Serbian), or the orthography rules, or similar.

On other systems, some variations of this scheme are used, such as ‘ll’. You can get the list of locales supported by your system for your language by running the command ‘locale -a | grep '^ll'’.

There is also a special locale, called ‘C’. When it is used, it disables all localization: in this locale, all programs standardized by POSIX use English messages and an unspecified character encoding (often US-ASCII, but sometimes also ISO-8859-1 or UTF-8, depending on the operating system).


Next: , Previous: Locale Names, Up: Setting the POSIX Locale

2.3.2 Locale Environment Variables

A locale is composed of several locale categories, see Aspects. When a program looks up locale dependent values, it does this according to the following environment variables, in priority order:

  1. LANGUAGE
  2. LC_ALL
  3. LC_xxx, according to selected locale category: LC_CTYPE, LC_NUMERIC, LC_TIME, LC_COLLATE, LC_MONETARY, LC_MESSAGES, ...
  4. LANG

Variables whose value is set but is empty are ignored in this lookup.

LANG is the normal environment variable for specifying a locale. As a user, you normally set this variable (unless some of the other variables have already been set by the system, in /etc/profile or similar initialization files).

LC_CTYPE, LC_NUMERIC, LC_TIME, LC_COLLATE, LC_MONETARY, LC_MESSAGES, and so on, are the environment variables meant to override LANG and affecting a single locale category only. For example, assume you are a Swedish user in Spain, and you want your programs to handle numbers and dates according to Spanish conventions, and only the messages should be in Swedish. Then you could create a locale named ‘sv_ES’ or ‘sv_ES.UTF-8’ by use of the localedef program. But it is simpler, and achieves the same effect, to set the LANG variable to es_ES.UTF-8 and the LC_MESSAGES variable to sv_SE.UTF-8; these two locales come already preinstalled with the operating system.

LC_ALL is an environment variable that overrides all of these. It is typically used in scripts that run particular programs. For example, configure scripts generated by GNU autoconf use LC_ALL to make sure that the configuration tests don't operate in locale dependent ways.

Some systems, unfortunately, set LC_ALL in /etc/profile or in similar initialization files. As a user, you therefore have to unset this variable if you want to set LANG and optionally some of the other LC_xxx variables.

The LANGUAGE variable is described in the next subsection.


Previous: Locale Environment Variables, Up: Setting the POSIX Locale

2.3.3 Specifying a Priority List of Languages

Not all programs have translations for all languages. By default, an English message is shown in place of a nonexistent translation. If you understand other languages, you can set up a priority list of languages. This is done through a different environment variable, called LANGUAGE. GNU gettext gives preference to LANGUAGE over LC_ALL and LANG for the purpose of message handling, but you still need to have LANG (or LC_ALL) set to the primary language; this is required by other parts of the system libraries. For example, some Swedish users who would rather read translations in German than English for when Swedish is not available, set LANGUAGE to ‘sv:de’ while leaving LANG to ‘sv_SE’.

Special advice for Norwegian users: The language code for Norwegian bokma*l changed from ‘no’ to ‘nb’ recently (in 2003). During the transition period, while some message catalogs for this language are installed under ‘nb’ and some older ones under ‘no’, it is recommended for Norwegian users to set LANGUAGE to ‘nb:no’ so that both newer and older translations are used.

In the LANGUAGE environment variable, but not in the other environment variables, ‘ll_CC’ combinations can be abbreviated as ‘ll’ to denote the language's main dialect. For example, ‘de’ is equivalent to ‘de_DE’ (German as spoken in Germany), and ‘pt’ to ‘pt_PT’ (Portuguese as spoken in Portugal) in this context.

Note: The variable LANGUAGE is ignored if the locale is set to ‘C’. In other words, you have to first enable localization, by setting LANG (or LC_ALL) to a value other than ‘C’, before you can use a language priority list through the LANGUAGE variable.


Previous: Setting the POSIX Locale, Up: Users

2.4 Installing Translations for Particular Programs

Languages are not equally well supported in all packages using GNU gettext, and more translations are added over time. Usually, you use the translations that are shipped with the operating system or with particular packages that you install afterwards. But you can also install newer localizations directly. For doing this, you will need an understanding where each localization file is stored on the file system.

For programs that participate in the Translation Project, you can start looking for translations here: http://translationproject.org/team/index.html. A snapshot of this information is also found in the ABOUT-NLS file that is shipped with GNU gettext.

For programs that are part of the KDE project, the starting point is: http://i18n.kde.org/.

For programs that are part of the GNOME project, the starting point is: http://www.gnome.org/i18n/.

For other programs, you may check whether the program's source code package contains some ll.po files; often they are kept together in a directory called po/. Each ll.po file contains the message translations for the language whose abbreviation of ll.


Next: , Previous: Users, Up: Top

3 The Format of PO Files

The GNU gettext toolset helps programmers and translators at producing, updating and using translation files, mainly those PO files which are textual, editable files. This chapter explains the format of PO files.

A PO file is made up of many entries, each entry holding the relation between an original untranslated string and its corresponding translation. All entries in a given PO file usually pertain to a single project, and all translations are expressed in a single target language. One PO file entry has the following schematic structure:

     white-space
     #  translator-comments
     #. extracted-comments
     #: reference...
     #, flag...
     #| msgid previous-untranslated-string
     msgid untranslated-string
     msgstr translated-string

The general structure of a PO file should be well understood by the translator. When using PO mode, very little has to be known about the format details, as PO mode takes care of them for her.

A simple entry can look like this:

     #: lib/error.c:116
     msgid "Unknown system error"
     msgstr "Error desconegut del sistema"

Entries begin with some optional white space. Usually, when generated through GNU gettext tools, there is exactly one blank line between entries. Then comments follow, on lines all starting with the character #. There are two kinds of comments: those which have some white space immediately following the # - the translator comments -, which comments are created and maintained exclusively by the translator, and those which have some non-white character just after the # - the automatic comments -, which comments are created and maintained automatically by GNU gettext tools. Comment lines starting with #. contain comments given by the programmer, directed at the translator; these comments are called extracted comments because the xgettext program extracts them from the program's source code. Comment lines starting with #: contain references to the program's source code. Comment lines starting with #, contain flags; more about these below. Comment lines starting with #| contain the previous untranslated string for which the translator gave a translation.

All comments, of either kind, are optional.

After white space and comments, entries show two strings, namely first the untranslated string as it appears in the original program sources, and then, the translation of this string. The original string is introduced by the keyword msgid, and the translation, by msgstr. The two strings, untranslated and translated, are quoted in various ways in the PO file, using " delimiters and \ escapes, but the translator does not really have to pay attention to the precise quoting format, as PO mode fully takes care of quoting for her.

The msgid strings, as well as automatic comments, are produced and managed by other GNU gettext tools, and PO mode does not provide means for the translator to alter these. The most she can do is merely deleting them, and only by deleting the whole entry. On the other hand, the msgstr string, as well as translator comments, are really meant for the translator, and PO mode gives her the full control she needs.

The comment lines beginning with #, are special because they are not completely ignored by the programs as comments generally are. The comma separated list of flags is used by the msgfmt program to give the user some better diagnostic messages. Currently there are two forms of flags defined:

fuzzy
This flag can be generated by the msgmerge program or it can be inserted by the translator herself. It shows that the msgstr string might not be a correct translation (anymore). Only the translator can judge if the translation requires further modification, or is acceptable as is. Once satisfied with the translation, she then removes this fuzzy attribute. The msgmerge program inserts this when it combined the msgid and msgstr entries after fuzzy search only. See Fuzzy Entries.
c-format
no-c-format
These flags should not be added by a human. Instead only the xgettext program adds them. In an automated PO file processing system as proposed here the user changes would be thrown away again as soon as the xgettext program generates a new template file.

The c-format flag tells that the untranslated string and the translation are supposed to be C format strings. The no-c-format flag tells that they are not C format strings, even though the untranslated string happens to look like a C format string (with ‘%’ directives).

In case the c-format flag is given for a string the msgfmt does some more tests to check to validity of the translation. See msgfmt Invocation, c-format Flag and c-format.

objc-format
no-objc-format
Likewise for Objective C, see objc-format.
sh-format
no-sh-format
Likewise for Shell, see sh-format.
python-format
no-python-format
Likewise for Python, see python-format.
lisp-format
no-lisp-format
Likewise for Lisp, see lisp-format.
elisp-format
no-elisp-format
Likewise for Emacs Lisp, see elisp-format.
librep-format
no-librep-format
Likewise for librep, see librep-format.
scheme-format
no-scheme-format
Likewise for Scheme, see scheme-format.
smalltalk-format
no-smalltalk-format
Likewise for Smalltalk, see smalltalk-format.
java-format
no-java-format
Likewise for Java, see java-format.
csharp-format
no-csharp-format
Likewise for C#, see csharp-format.
awk-format
no-awk-format
Likewise for awk, see awk-format.
object-pascal-format
no-object-pascal-format
Likewise for Object Pascal, see object-pascal-format.
ycp-format
no-ycp-format
Likewise for YCP, see ycp-format.
tcl-format
no-tcl-format
Likewise for Tcl, see tcl-format.
perl-format
no-perl-format
Likewise for Perl, see perl-format.
perl-brace-format
no-perl-brace-format
Likewise for Perl brace, see perl-format.
php-format
no-php-format
Likewise for PHP, see php-format.
gcc-internal-format
no-gcc-internal-format
Likewise for the GCC sources, see gcc-internal-format.
qt-format
no-qt-format
Likewise for Qt, see qt-format.
kde-format
no-kde-format
Likewise for KDE, see kde-format.
boost-format
no-boost-format
Likewise for Boost, see boost-format.

It is also possible to have entries with a context specifier. They look like this:

     white-space
     #  translator-comments
     #. extracted-comments
     #: reference...
     #, flag...
     #| msgctxt previous-context
     #| msgid previous-untranslated-string
     msgctxt context
     msgid untranslated-string
     msgstr translated-string

The context serves to disambiguate messages with the same untranslated-string. It is possible to have several entries with the same untranslated-string in a PO file, provided that they each have a different context. Note that an empty context string and an absent msgctxt line do not mean the same thing.

A different kind of entries is used for translations which involve plural forms.

     white-space
     #  translator-comments
     #. extracted-comments
     #: reference...
     #, flag...
     #| msgid previous-untranslated-string-singular
     #| msgid_plural previous-untranslated-string-plural
     msgid untranslated-string-singular
     msgid_plural untranslated-string-plural
     msgstr[0] translated-string-case-0
     ...
     msgstr[N] translated-string-case-n

Such an entry can look like this:

     #: src/msgcmp.c:338 src/po-lex.c:699
     #, c-format
     msgid "found %d fatal error"
     msgid_plural "found %d fatal errors"
     msgstr[0] "s'ha trobat %d error fatal"
     msgstr[1] "s'han trobat %d errors fatals"

Here also, a msgctxt context can be specified before msgid, like above.

The previous-untranslated-string is optionally inserted by the msgmerge program, at the same time when it marks a message fuzzy. It helps the translator to see which changes were done by the developers on the untranslated-string.

It happens that some lines, usually whitespace or comments, follow the very last entry of a PO file. Such lines are not part of any entry, and will be dropped when the PO file is processed by the tools, or may disturb some PO file editors.

The remainder of this section may be safely skipped by those using a PO file editor, yet it may be interesting for everybody to have a better idea of the precise format of a PO file. On the other hand, those wishing to modify PO files by hand should carefully continue reading on.

Each of untranslated-string and translated-string respects the C syntax for a character string, including the surrounding quotes and embedded backslashed escape sequences. When the time comes to write multi-line strings, one should not use escaped newlines. Instead, a closing quote should follow the last character on the line to be continued, and an opening quote should resume the string at the beginning of the following PO file line. For example:

     msgid ""
     "Here is an example of how one might continue a very long string\n"
     "for the common case the string represents multi-line output.\n"

In this example, the empty string is used on the first line, to allow better alignment of the H from the word ‘Here’ over the f from the word ‘for’. In this example, the msgid keyword is followed by three strings, which are meant to be concatenated. Concatenating the empty string does not change the resulting overall string, but it is a way for us to comply with the necessity of msgid to be followed by a string on the same line, while keeping the multi-line presentation left-justified, as we find this to be a cleaner disposition. The empty string could have been omitted, but only if the string starting with ‘Here’ was promoted on the first line, right after msgid.2 It was not really necessary either to switch between the two last quoted strings immediately after the newline ‘\n’, the switch could have occurred after any other character, we just did it this way because it is neater.

One should carefully distinguish between end of lines marked as ‘\ninside quotes, which are part of the represented string, and end of lines in the PO file itself, outside string quotes, which have no incidence on the represented string.

Outside strings, white lines and comments may be used freely. Comments start at the beginning of a line with ‘#’ and extend until the end of the PO file line. Comments written by translators should have the initial ‘#’ immediately followed by some white space. If the ‘#’ is not immediately followed by white space, this comment is most likely generated and managed by specialized GNU tools, and might disappear or be replaced unexpectedly when the PO file is given to msgmerge.


Next: , Previous: PO Files, Up: Top

4 Preparing Program Sources

For the programmer, changes to the C source code fall into three categories. First, you have to make the localization functions known to all modules needing message translation. Second, you should properly trigger the operation of GNU gettext when the program initializes, usually from the main function. Last, you should identify, adjust and mark all constant strings in your program needing translation.


Next: , Previous: Sources, Up: Sources

4.1 Importing the gettext declaration

Presuming that your set of programs, or package, has been adjusted so all needed GNU gettext files are available, and your Makefile files are adjusted (see Maintainers), each C module having translated C strings should contain the line:

     #include <libintl.h>

Similarly, each C module containing printf()/fprintf()/... calls with a format string that could be a translated C string (even if the C string comes from a different C module) should contain the line:

     #include <libintl.h>


Next: , Previous: Importing, Up: Sources

4.2 Triggering gettext Operations

The initialization of locale data should be done with more or less the same code in every program, as demonstrated below:

     int
     main (int argc, char *argv[])
     {
       ...
       setlocale (LC_ALL, "");
       bindtextdomain (PACKAGE, LOCALEDIR);
       textdomain (PACKAGE);
       ...
     }

PACKAGE and LOCALEDIR should be provided either by config.h or by the Makefile. For now consult the gettext or hello sources for more information.

The use of LC_ALL might not be appropriate for you. LC_ALL includes all locale categories and especially LC_CTYPE. This latter category is responsible for determining character classes with the isalnum etc. functions from ctype.h which could especially for programs, which process some kind of input language, be wrong. For example this would mean that a source code using the ç (c-cedilla character) is runnable in France but not in the U.S.

Some systems also have problems with parsing numbers using the scanf functions if an other but the LC_ALL locale category is used. The standards say that additional formats but the one known in the "C" locale might be recognized. But some systems seem to reject numbers in the "C" locale format. In some situation, it might also be a problem with the notation itself which makes it impossible to recognize whether the number is in the "C" locale or the local format. This can happen if thousands separator characters are used. Some locales define this character according to the national conventions to '.' which is the same character used in the "C" locale to denote the decimal point.

So it is sometimes necessary to replace the LC_ALL line in the code above by a sequence of setlocale lines

     {
       ...
       setlocale (LC_CTYPE, "");
       setlocale (LC_MESSAGES, "");
       ...
     }

On all POSIX conformant systems the locale categories LC_CTYPE, LC_MESSAGES, LC_COLLATE, LC_MONETARY, LC_NUMERIC, and LC_TIME are available. On some systems which are only ISO C compliant, LC_MESSAGES is missing, but a substitute for it is defined in GNU gettext's <libintl.h> and in GNU gnulib's <locale.h>.

Note that changing the LC_CTYPE also affects the functions declared in the <ctype.h> standard header and some functions declared in the <string.h> and <stdlib.h> standard headers. If this is not desirable in your application (for example in a compiler's parser), you can use a set of substitute functions which hardwire the C locale, such as found in the modules ‘c-ctype’, ‘c-strcase’, ‘c-strcasestr’, ‘c-strtod’, ‘c-strtold’ in the GNU gnulib source distribution.

It is also possible to switch the locale forth and back between the environment dependent locale and the C locale, but this approach is normally avoided because a setlocale call is expensive, because it is tedious to determine the places where a locale switch is needed in a large program's source, and because switching a locale is not multithread-safe.


Next: , Previous: Triggering, Up: Sources

4.3 Preparing Translatable Strings

Before strings can be marked for translations, they sometimes need to be adjusted. Usually preparing a string for translation is done right before marking it, during the marking phase which is described in the next sections. What you have to keep in mind while doing that is the following.

Let's look at some examples of these guidelines.

Translatable strings should be in good English style. If slang language with abbreviations and shortcuts is used, often translators will not understand the message and will produce very inappropriate translations.

     "%s: is parameter\n"

This is nearly untranslatable: Is the displayed item a parameter or the parameter?

     "No match"

The ambiguity in this message makes it unintelligible: Is the program attempting to set something on fire? Does it mean "The given object does not match the template"? Does it mean "The template does not fit for any of the objects"?

In both cases, adding more words to the message will help both the translator and the English speaking user.

Translatable strings should be entire sentences. It is often not possible t