gettext utilities
gettext declaration
gettext Operations
gettext Installation
msgcat Program
msgconv Program
msggrep Program
msgfilter Program
msguniq Program
msgcomm Program
msgcmp Program
msgattrib Program
msgen Program
msgexec Program
msgfmt Program
msgunfmt Program
gettextize Program
gettext utilitiesThis manual documents the GNU gettext tools and the GNU libintl library, version 0.17.
--- The Detailed Node Listing ---
Introduction
The User's View
Setting the POSIX Locale
Preparing Program Sources
Making the PO Template File
Creating a New PO File
Updating Existing PO Files
Editing PO Files
Emacs's PO File Editor
Using Translation Compendia
Manipulating PO Files
Highlighting parts of PO files
Producing Binary MO Files
The Programmer's View
About catgets
About gettext
Temporary Notes for the Programmers Chapter
The Translator's View
Organization
National Teams
The Maintainer's View
Files You Must Create or Alter
Autoconf macros for use in configure.ac
Integrating with CVS
Other Programming Languages
The Translator's View
Individual Programming Languages
sh - Shell Script
Perl
Internationalizable Data
Concluding Remarks
Language Codes
Licenses
This chapter explains the goals sought in the creation
of GNU gettext and the free Translation Project.
Then, it explains a few broad concepts around
Native Language Support, and positions message translation with regard
to other aspects of national and cultural variance, as they apply
to programs. It also surveys those files used to convey the
translations. It explains how the various tools interact in the
initial generation of these files, and later, how the maintenance
cycle should usually operate.
In this manual, we use he when speaking of the programmer or
maintainer, she when speaking of the translator, and they
when speaking of the installers or end users of the translated program.
This is only a convenience for clarifying the documentation. It is
absolutely not meant to imply that some roles are more appropriate
to males or females. Besides, as you might guess, GNU gettext
is meant to be useful for people using computers, whatever their sex,
race, religion or nationality!
Please send suggestions and corrections to:
Internet address:
bug-gnu-gettext@gnu.org
Please include the manual's edition number and update date in your messages.
gettextUsually, programs are written and documented in English, and use English at execution time to interact with users. This is true not only of GNU software, but also of a great deal of proprietary and free software. Using a common language is quite handy for communication between developers, maintainers and users from all countries. On the other hand, most people are less comfortable with English than with their own native language, and would prefer to use their mother tongue for day to day's work, as far as possible. Many would simply love to see their computer screen showing a lot less of English, and far more of their own language.
However, to many people, this dream might appear so far fetched that they may believe it is not even worth spending time thinking about it. They have no confidence at all that the dream might ever become true. Yet some have not lost hope, and have organized themselves. The Translation Project is a formalization of this hope into a workable structure, which has a good chance to get all of us nearer the achievement of a truly multi-lingual set of programs.
GNU gettext is an important step for the Translation Project,
as it is an asset on which we may build many other steps. This package
offers to programmers, translators and even users, a well integrated
set of tools and documentation. Specifically, the GNU gettext
utilities are a set of tools that provides a framework within which
other free packages may produce multi-lingual messages. These tools
include
GNU gettext is designed to minimize the impact of
internationalization on program sources, keeping this impact as small
and hardly noticeable as possible. Internationalization has better
chances of succeeding if it is very light weighted, or at least,
appear to be so, when looking at program sources.
The Translation Project also uses the GNU gettext distribution
as a vehicle for documenting its structure and methods. This goes
beyond the strict technicalities of documenting the GNU gettext
proper. By so doing, translators will find in a single place, as
far as possible, all they need to know for properly doing their
translating work. Also, this supplemental documentation might also
help programmers, and even curious users, in understanding how GNU
gettext is related to the remainder of the Translation
Project, and consequently, have a glimpse at the big picture.
Two long words appear all the time when we discuss support of native language in programs, and these words have a precise meaning, worth being explained here, once and for all in this document. The words are internationalization and localization. Many people, tired of writing these long words over and over again, took the habit of writing i18n and l10n instead, quoting the first and last letter of each word, and replacing the run of intermediate letters by a number merely telling how many such letters there are. But in this manual, in the sake of clarity, we will patiently write the names in full, each time...
By internationalization, one refers to the operation by which a
program, or a set of programs turned into a package, is made aware of and
able to support multiple languages. This is a generalization process,
by which the programs are untied from calling only English strings or
other English specific habits, and connected to generic ways of doing
the same, instead. Program developers may use various techniques to
internationalize their programs. Some of these have been standardized.
GNU gettext offers one of these standards. See Programmers.
By localization, one means the operation by which, in a set of programs already internationalized, one gives the program all needed information so that it can adapt itself to handle its input and output in a fashion which is correct for some native language and cultural habits. This is a particularisation process, by which generic methods already implemented in an internationalized program are used in specific ways. The programming environment puts several functions to the programmers disposal which allow this runtime configuration. The formal description of specific set of cultural habits for some country, together with all associated translations targeted to the same native language, is called the locale for this language or country. Users achieve localization of programs by setting proper values to special environment variables, prior to executing those programs, identifying which locale should be used.
In fact, locale message support is only one component of the cultural data that makes up a particular locale. There are a whole host of routines and functions provided to aid programmers in developing internationalized software and which allow them to access the data stored in a particular locale. When someone presently refers to a particular locale, they are obviously referring to the data stored within that particular locale. Similarly, if a programmer is referring to “accessing the locale routines”, they are referring to the complete suite of routines that access all of the locale's information.
One uses the expression Native Language Support, or merely NLS, for speaking of the overall activity or feature encompassing both internationalization and localization, allowing for multi-lingual interactions in a program. In a nutshell, one could say that internationalization is the operation by which further localizations are made possible.
Also, very roughly said, when it comes to multi-lingual messages, internationalization is usually taken care of by programmers, and localization is usually taken care of by translators.
For a totally multi-lingual distribution, there are many things to translate beyond output messages.
gettext offers a complete toolset for
translating messages output by C programs. Perl scripts and shell
scripts will also need to be translated. Even if there are today some hooks
by which this can be done, these hooks are not integrated as well as they
should be.
autoconf or bison, are able
to produce other programs (or scripts). Even if the generating
programs themselves are internationalized, the generated programs they
produce may need internationalization on their own, and this indirect
internationalization could be automated right from the generating
program. In fact, quite usually, generating and generated programs
could be internationalized independently, as the effort needed is
fairly orthogonal.
recode program is able to reconstruct at execution.
Since these descriptions are extracted from the RFC by mechanical means,
translating them properly would require a prior translation of the RFC
itself.
gcc to allow diacriticized characters in identifiers or use
translated keywords; ‘rm -i’ might accept something else than
‘y’ or ‘n’ for replies, etc. Even if the program will
eventually make most of its output in the foreign languages, one has
to decide whether the input syntax, option values, etc., are to be
localized or not.
As we already stressed, translation is only one aspect of locales.
Other internationalization aspects are system services and are handled
in GNU libc. There
are many attributes that are needed to define a country's cultural
conventions. These attributes include beside the country's native
language, the formatting of the date and time, the representation of
numbers, the symbols for currency, etc. These local rules are
termed the country's locale. The locale represents the knowledge
needed to support the country's native attributes.
There are a few major areas which may vary between countries and
hence, define what a locale must describe. The following list helps
putting multi-lingual messages into the proper context of other tasks
related to locales. See the GNU libc manual for details.
Time of the day may be noted as hh:mm, hh.mm,
or otherwise. Some locales require time to be specified in 24-hour
mode rather than as AM or PM. Further, the nature and yearly extent
of the Daylight Saving correction vary widely between countries.
12,345.67 English
12.345,67 German
12345,67 French
1,2345.67 Asia
Some programs could go further and use different unit systems, like
English units or Metric units, or even take into account variants
about how numbers are spelled in full.
gettext provides the means for developers and users to
easily change the language that the software uses to communicate to
the user.
These areas of cultural conventions are called locale categories. It is an unfortunate term; locale aspects or locale feature categories would be a better term, because each “locale category” describes an area or task that requires localization. The concrete data that describes the cultural conventions for such an area and for a particular culture is also called a locale category. In this sense, a locale is composed of several locale categories: the locale category describing the codeset, the locale category describing the formatting of numbers, the locale category containing the translated messages, and so on.
Components of locale outside of message handling are standardized in
the ISO C standard and the POSIX:2001 standard (also known as the SUSV3
specification). GNU libc
fully implements this, and most other modern systems provide a more
or less reasonable support for at least some of the missing components.
The letters PO in .po files means Portable Object, to distinguish it from .mo files, where MO stands for Machine Object. This paradigm, as well as the PO file format, is inspired by the NLS standard developed by Uniforum, and first implemented by Sun in their Solaris system.
PO files are meant to be read and edited by humans, and associate each
original, translatable string of a given package with its translation
in a particular target language. A single PO file is dedicated to
a single target language. If a package supports many languages,
there is one such PO file per language supported, and each package
has its own set of PO files. These PO files are best created by
the xgettext program, and later updated or refreshed through
the msgmerge program. Program xgettext extracts all
marked messages from a set of C files and initializes a PO file with
empty translations. Program msgmerge takes care of adjusting
PO files between releases of the corresponding sources, commenting
obsolete entries, initializing new ones, and updating all source
line references. Files ending with .pot are kind of base
translation files found in distributions, in PO file format.
MO files are meant to be read by programs, and are binary in nature.
A few systems already offer tools for creating and handling MO files
as part of the Native Language Support coming with the system, but the
format of these MO files is often different from system to system,
and non-portable. The tools already provided with these systems don't
support all the features of GNU gettext. Therefore GNU
gettext uses its own format for MO files. Files ending with
.gmo are really MO files, when it is known that these files use
the GNU format.
gettextThe following diagram summarizes the relation between the files
handled by GNU gettext and the tools acting on these files.
It is followed by somewhat detailed explanations, which you should
read while keeping an eye on the diagram. Having a clear understanding
of these interrelations will surely help programmers, translators
and maintainers.
Original C Sources ───> Preparation ───> Marked C Sources ───╮
│
╭─────────<─── GNU gettext Library │
╭─── make <───┤ │
│ ╰─────────<────────────────────┬───────────────╯
│ │
│ ╭─────<─── PACKAGE.pot <─── xgettext <───╯ ╭───<─── PO Compendium
│ │ │ ↑
│ │ ╰───╮ │
│ ╰───╮ ├───> PO editor ───╮
│ ├────> msgmerge ──────> LANG.po ────>────────╯ │
│ ╭───╯ │
│ │ │
│ ╰─────────────<───────────────╮ │
│ ├─── New LANG.po <────────────────────╯
│ ╭─── LANG.gmo <─── msgfmt <───╯
│ │
│ ╰───> install ───> /.../LANG/PACKAGE.mo ───╮
│ ├───> "Hello world!"
╰───────> install ───> /.../bin/PROGRAM ───────╯
As a programmer, the first step to bringing GNU gettext
into your package is identifying, right in the C sources, those strings
which are meant to be translatable, and those which are untranslatable.
This tedious job can be done a little more comfortably using emacs PO
mode, but you can use any means familiar to you for modifying your
C sources. Beside this some other simple, standard changes are needed to
properly initialize the translation library. See Sources, for
more information about all this.
For newly written software the strings of course can and should be
marked while writing it. The gettext approach makes this
very easy. Simply put the following lines at the beginning of each file
or in a central header file:
#define _(String) (String)
#define N_(String) String
#define textdomain(Domain)
#define bindtextdomain(Package, Directory)
Doing this allows you to prepare the sources for internationalization.
Later when you feel ready for the step to use the gettext library
simply replace these definitions by the following:
#include <libintl.h>
#define _(String) gettext (String)
#define gettext_noop(String) String
#define N_(String) gettext_noop (String)
and link against libintl.a or libintl.so. Note that on
GNU systems, you don't need to link with libintl because the
gettext library functions are already contained in GNU libc.
That is all you have to change.
Once the C sources have been modified, the xgettext program
is used to find and extract all translatable strings, and create a
PO template file out of all these. This package.pot file
contains all original program strings. It has sets of pointers to
exactly where in C sources each string is used. All translations
are set to empty. The letter t in .pot marks this as
a Template PO file, not yet oriented towards any particular language.
See xgettext Invocation, for more details about how one calls the
xgettext program. If you are really lazy, you might
be interested at working a lot more right away, and preparing the
whole distribution setup (see Maintainers). By doing so, you
spare yourself typing the xgettext command, as make
should now generate the proper things automatically for you!
The first time through, there is no lang.po yet, so the
msgmerge step may be skipped and replaced by a mere copy of
package.pot to lang.po, where lang
represents the target language. See Creating for details.
Then comes the initial translation of messages. Translation in itself is a whole matter, still exclusively meant for humans, and whose complexity far overwhelms the level of this manual. Nevertheless, a few hints are given in some other chapter of this manual (see Translators). You will also find there indications about how to contact translating teams, or becoming part of them, for sharing your translating concerns with others who target the same native language.
While adding the translated messages into the lang.po PO file, if you are not using one of the dedicated PO file editors (see Editing), you are on your own for ensuring that your efforts fully respect the PO file format, and quoting conventions (see PO Files). This is surely not an impossible task, as this is the way many people have handled PO files around 1995. On the other hand, by using a PO file editor, most details of PO file format are taken care of for you, but you have to acquire some familiarity with PO file editor itself.
If some common translations have already been saved into a compendium PO file, translators may use PO mode for initializing untranslated entries from the compendium, and also save selected translations into the compendium, updating it (see Compendium). Compendium files are meant to be exchanged between members of a given translation team.
Programs, or packages of programs, are dynamic in nature: users write bug reports and suggestion for improvements, maintainers react by modifying programs in various ways. The fact that a package has already been internationalized should not make maintainers shy of adding new strings, or modifying strings already translated. They just do their job the best they can. For the Translation Project to work smoothly, it is important that maintainers do not carry translation concerns on their already loaded shoulders, and that translators be kept as free as possible of programming concerns.
The only concern maintainers should have is carefully marking new
strings as translatable, when they should be, and do not otherwise
worry about them being translated, as this will come in proper time.
Consequently, when programs and their strings are adjusted in various
ways by maintainers, and for matters usually unrelated to translation,
xgettext would construct package.pot files which are
evolving over time, so the translations carried by lang.po
are slowly fading out of date.
It is important for translators (and even maintainers) to understand that package translation is a continuous process in the lifetime of a package, and not something which is done once and for all at the start. After an initial burst of translation activity for a given package, interventions are needed once in a while, because here and there, translated entries become obsolete, and new untranslated entries appear, needing translation.
The msgmerge program has the purpose of refreshing an already
existing lang.po file, by comparing it with a newer
package.pot template file, extracted by xgettext
out of recent C sources. The refreshing operation adjusts all
references to C source locations for strings, since these strings
move as programs are modified. Also, msgmerge comments out as
obsolete, in lang.po, those already translated entries
which are no longer used in the program sources (see Obsolete Entries). It finally discovers new strings and inserts them in
the resulting PO file as untranslated entries (see Untranslated Entries). See msgmerge Invocation, for more information about what
msgmerge really does.
Whatever route or means taken, the goal is to obtain an updated lang.po file offering translations for all strings.
The temporal mobility, or fluidity of PO files, is an integral part of the translation game, and should be well understood, and accepted. People resisting it will have a hard time participating in the Translation Project, or will give a hard time to other participants! In particular, maintainers should relax and include all available official PO files in their distributions, even if these have not recently been updated, without exerting pressure on the translator teams to get the job done. The pressure should rather come from the community of users speaking a particular language, and maintainers should consider themselves fairly relieved of any concern about the adequacy of translation files. On the other hand, translators should reasonably try updating the PO files they are responsible for, while the package is undergoing pretest, prior to an official distribution.
Once the PO file is complete and dependable, the msgfmt program
is used for turning the PO file into a machine-oriented format, which
may yield efficient retrieval of translations by the programs of the
package, whenever needed at runtime (see MO Files). See msgfmt Invocation, for more information about all modes of execution
for the msgfmt program.
Finally, the modified and marked C sources are compiled and linked
with the GNU gettext library, usually through the operation of
make, given a suitable Makefile exists for the project,
and the resulting executable is installed somewhere users will find it.
The MO files themselves should also be properly installed. Given the
appropriate environment variables are set (see Setting the POSIX Locale),
the program should localize itself automatically, whenever it executes.
The remainder of this manual has the purpose of explaining in depth the various steps outlined above.
Nowadays, when users log into a computer, they usually find that all their programs show messages in their native language – at least for users of languages with an active free software community, like French or German; to a lesser extent for languages with a smaller participation in free software and the GNU project, like Hindi and Filipino.
How does this work? How can the user influence the language that is used by the programs? This chapter will answer it.
The default language is often already specified during operating system installation. When the operating system is installed, the installer typically asks for the language used for the installation process and, separately, for the language to use in the installed system. Some OS installers only ask for the language once.
This determines the system-wide default language for all users. But the installers often give the possibility to install extra localizations for additional languages. For example, the localizations of KDE (the K Desktop Environment) and OpenOffice.org are often bundled separately, as one installable package per language.
At this point it is good to consider the intended use of the machine: If it is a machine designated for personal use, additional localizations are probably not necessary. If, however, the machine is in use in an organization or company that has international relationships, one can consider the needs of guest users. If you have a guest from abroad, for a week, what could be his preferred locales? It may be worth installing these additional localizations ahead of time, since they cost only a bit of disk space at this point.
The system-wide default language is the locale configuration that is used when a new user account is created. But the user can have his own locale configuration that is different from the one of the other users of the same machine. He can specify it, typically after the first login, as described in the next section.
The immediately available programs in a user's desktop come from a group of programs called a “desktop environment”; it usually includes the window manager, a web browser, a text editor, and more. The most common free desktop environments are KDE, GNOME, and Xfce.
The locale used by GUI programs of the desktop environment can be specified in a configuration screen called “control center”, “language settings” or “country settings”.
Individual GUI programs that are not part of the desktop environment can have their locale specified either in a settings panel, or through environment variables.
For some programs, it is possible to specify the locale through environment
variables, possibly even to a different locale than the desktop's locale.
This means, instead of starting a program through a menu or from the file
system, you can start it from the command-line, after having set some
environment variables. The environment variables can be those specified
in the next section (Setting the POSIX Locale); for some versions of
KDE, however, the locale is specified through a variable KDE_LANG,
rather than LANG or LC_ALL.
As a user, if your language has been installed for this package, in the
simplest case, you only have to set the LANG environment variable
to the appropriate ‘ll_CC’ combination. For example,
let's suppose that you speak German and live in Germany. At the shell
prompt, merely execute
‘setenv LANG de_DE’ (in csh),
‘export LANG; LANG=de_DE’ (in sh) or
‘export LANG=de_DE’ (in bash). This can be done from your
.login or .profile file, once and for all.
A locale name usually has the form ‘ll_CC’. Here
‘ll’ is an ISO 639 two-letter language code, and
‘CC’ is an ISO 3166 two-letter country code. For example,
for German in Germany, ll is de, and CC is DE.
You find a list of the language codes in appendix Language Codes and
a list of the country codes in appendix Country Codes.
You might think that the country code specification is redundant. But in fact, some languages have dialects in different countries. For example, ‘de_AT’ is used for Austria, and ‘pt_BR’ for Brazil. The country code serves to distinguish the dialects.
Many locale names have an extended syntax ‘ll_CC.encoding’ that also specifies the character encoding. These are in use because between 2000 and 2005, most users have switched to locales in UTF-8 encoding. For example, the German locale on glibc systems is nowadays ‘de_DE.UTF-8’. The older name ‘de_DE’ still refers to the German locale as of 2000 that stores characters in ISO-8859-1 encoding – a text encoding that cannot even accomodate the Euro currency sign.
Some locale names use ‘ll_CC.@variant’ instead of ‘ll_CC’. The ‘@variant’ can denote any kind of characteristics that is not already implied by the language ll and the country CC. It can denote a particular monetary unit. For example, on glibc systems, ‘de_DE@euro’ denotes the locale that uses the Euro currency, in contrast to the older locale ‘de_DE’ which implies the use of the currency before 2002. It can also denote a dialect of the language, or the script used to write text (for example, ‘sr_RS@latin’ uses the Latin script, whereas ‘sr_RS’ uses the Cyrillic script to write Serbian), or the orthography rules, or similar.
On other systems, some variations of this scheme are used, such as ‘ll’. You can get the list of locales supported by your system for your language by running the command ‘locale -a | grep '^ll'’.
There is also a special locale, called ‘C’. When it is used, it disables all localization: in this locale, all programs standardized by POSIX use English messages and an unspecified character encoding (often US-ASCII, but sometimes also ISO-8859-1 or UTF-8, depending on the operating system).
A locale is composed of several locale categories, see Aspects. When a program looks up locale dependent values, it does this according to the following environment variables, in priority order:
LANGUAGE
LC_ALL
LC_xxx, according to selected locale category:
LC_CTYPE, LC_NUMERIC, LC_TIME, LC_COLLATE,
LC_MONETARY, LC_MESSAGES, ...
LANG
Variables whose value is set but is empty are ignored in this lookup.
LANG is the normal environment variable for specifying a locale.
As a user, you normally set this variable (unless some of the other variables
have already been set by the system, in /etc/profile or similar
initialization files).
LC_CTYPE, LC_NUMERIC, LC_TIME, LC_COLLATE,
LC_MONETARY, LC_MESSAGES, and so on, are the environment
variables meant to override LANG and affecting a single locale
category only. For example, assume you are a Swedish user in Spain, and you
want your programs to handle numbers and dates according to Spanish
conventions, and only the messages should be in Swedish. Then you could
create a locale named ‘sv_ES’ or ‘sv_ES.UTF-8’ by use of the
localedef program. But it is simpler, and achieves the same effect,
to set the LANG variable to es_ES.UTF-8 and the
LC_MESSAGES variable to sv_SE.UTF-8; these two locales come
already preinstalled with the operating system.
LC_ALL is an environment variable that overrides all of these.
It is typically used in scripts that run particular programs. For example,
configure scripts generated by GNU autoconf use LC_ALL to make
sure that the configuration tests don't operate in locale dependent ways.
Some systems, unfortunately, set LC_ALL in /etc/profile or in
similar initialization files. As a user, you therefore have to unset this
variable if you want to set LANG and optionally some of the other
LC_xxx variables.
The LANGUAGE variable is described in the next subsection.
Not all programs have translations for all languages. By default, an
English message is shown in place of a nonexistent translation. If you
understand other languages, you can set up a priority list of languages.
This is done through a different environment variable, called
LANGUAGE. GNU gettext gives preference to LANGUAGE
over LC_ALL and LANG for the purpose of message handling,
but you still need to have LANG (or LC_ALL) set to the primary
language; this is required by other parts of the system libraries.
For example, some Swedish users who would rather read translations in
German than English for when Swedish is not available, set LANGUAGE
to ‘sv:de’ while leaving LANG to ‘sv_SE’.
Special advice for Norwegian users: The language code for Norwegian
bokma*l changed from ‘no’ to ‘nb’ recently (in 2003).
During the transition period, while some message catalogs for this language
are installed under ‘nb’ and some older ones under ‘no’, it is
recommended for Norwegian users to set LANGUAGE to ‘nb:no’ so that
both newer and older translations are used.
In the LANGUAGE environment variable, but not in the other
environment variables, ‘ll_CC’ combinations can be
abbreviated as ‘ll’ to denote the language's main dialect.
For example, ‘de’ is equivalent to ‘de_DE’ (German as spoken in
Germany), and ‘pt’ to ‘pt_PT’ (Portuguese as spoken in Portugal)
in this context.
Note: The variable LANGUAGE is ignored if the locale is set to
‘C’. In other words, you have to first enable localization, by setting
LANG (or LC_ALL) to a value other than ‘C’, before you can
use a language priority list through the LANGUAGE variable.
Languages are not equally well supported in all packages using GNU
gettext, and more translations are added over time. Usually, you
use the translations that are shipped with the operating system
or with particular packages that you install afterwards. But you can also
install newer localizations directly. For doing this, you will need an
understanding where each localization file is stored on the file system.
For programs that participate in the Translation Project, you can start looking for translations here: http://translationproject.org/team/index.html. A snapshot of this information is also found in the ABOUT-NLS file that is shipped with GNU gettext.
For programs that are part of the KDE project, the starting point is: http://i18n.kde.org/.
For programs that are part of the GNOME project, the starting point is: http://www.gnome.org/i18n/.
For other programs, you may check whether the program's source code package contains some ll.po files; often they are kept together in a directory called po/. Each ll.po file contains the message translations for the language whose abbreviation of ll.
The GNU gettext toolset helps programmers and translators
at producing, updating and using translation files, mainly those
PO files which are textual, editable files. This chapter explains
the format of PO files.
A PO file is made up of many entries, each entry holding the relation between an original untranslated string and its corresponding translation. All entries in a given PO file usually pertain to a single project, and all translations are expressed in a single target language. One PO file entry has the following schematic structure:
white-space
# translator-comments
#. extracted-comments
#: reference...
#, flag...
#| msgid previous-untranslated-string
msgid untranslated-string
msgstr translated-string
The general structure of a PO file should be well understood by the translator. When using PO mode, very little has to be known about the format details, as PO mode takes care of them for her.
A simple entry can look like this:
#: lib/error.c:116
msgid "Unknown system error"
msgstr "Error desconegut del sistema"
Entries begin with some optional white space. Usually, when generated
through GNU gettext tools, there is exactly one blank line
between entries. Then comments follow, on lines all starting with the
character #. There are two kinds of comments: those which have
some white space immediately following the # - the translator
comments -, which comments are created and maintained exclusively by the
translator, and those which have some non-white character just after the
# - the automatic comments -, which comments are created and
maintained automatically by GNU gettext tools. Comment lines
starting with #. contain comments given by the programmer, directed
at the translator; these comments are called extracted comments
because the xgettext program extracts them from the program's
source code. Comment lines starting with #: contain references to
the program's source code. Comment lines starting with #, contain
flags; more about these below. Comment lines starting with #|
contain the previous untranslated string for which the translator gave
a translation.
All comments, of either kind, are optional.
After white space and comments, entries show two strings, namely
first the untranslated string as it appears in the original program
sources, and then, the translation of this string. The original
string is introduced by the keyword msgid, and the translation,
by msgstr. The two strings, untranslated and translated,
are quoted in various ways in the PO file, using "
delimiters and \ escapes, but the translator does not really
have to pay attention to the precise quoting format, as PO mode fully
takes care of quoting for her.
The msgid strings, as well as automatic comments, are produced
and managed by other GNU gettext tools, and PO mode does not
provide means for the translator to alter these. The most she can
do is merely deleting them, and only by deleting the whole entry.
On the other hand, the msgstr string, as well as translator
comments, are really meant for the translator, and PO mode gives her
the full control she needs.
The comment lines beginning with #, are special because they are
not completely ignored by the programs as comments generally are. The
comma separated list of flags is used by the msgfmt
program to give the user some better diagnostic messages. Currently
there are two forms of flags defined:
fuzzymsgmerge program or it can be
inserted by the translator herself. It shows that the msgstr
string might not be a correct translation (anymore). Only the translator
can judge if the translation requires further modification, or is
acceptable as is. Once satisfied with the translation, she then removes
this fuzzy attribute. The msgmerge program inserts this
when it combined the msgid and msgstr entries after fuzzy
search only. See Fuzzy Entries.
c-formatno-c-formatxgettext program adds them. In an automated PO file processing
system as proposed here the user changes would be thrown away again as
soon as the xgettext program generates a new template file.
The c-format flag tells that the untranslated string and the
translation are supposed to be C format strings. The no-c-format
flag tells that they are not C format strings, even though the untranslated
string happens to look like a C format string (with ‘%’ directives).
In case the c-format flag is given for a string the msgfmt
does some more tests to check to validity of the translation.
See msgfmt Invocation, c-format Flag and c-format.
objc-formatno-objc-formatsh-formatno-sh-formatpython-formatno-python-formatlisp-formatno-lisp-formatelisp-formatno-elisp-formatlibrep-formatno-librep-formatscheme-formatno-scheme-formatsmalltalk-formatno-smalltalk-formatjava-formatno-java-formatcsharp-formatno-csharp-formatawk-formatno-awk-formatobject-pascal-formatno-object-pascal-formatycp-formatno-ycp-formattcl-formatno-tcl-formatperl-formatno-perl-formatperl-brace-formatno-perl-brace-formatphp-formatno-php-formatgcc-internal-formatno-gcc-internal-formatqt-formatno-qt-formatkde-formatno-kde-formatboost-formatno-boost-formatIt is also possible to have entries with a context specifier. They look like this:
white-space
# translator-comments
#. extracted-comments
#: reference...
#, flag...
#| msgctxt previous-context
#| msgid previous-untranslated-string
msgctxt context
msgid untranslated-string
msgstr translated-string
The context serves to disambiguate messages with the same
untranslated-string. It is possible to have several entries with
the same untranslated-string in a PO file, provided that they each
have a different context. Note that an empty context string
and an absent msgctxt line do not mean the same thing.
A different kind of entries is used for translations which involve plural forms.
white-space
# translator-comments
#. extracted-comments
#: reference...
#, flag...
#| msgid previous-untranslated-string-singular
#| msgid_plural previous-untranslated-string-plural
msgid untranslated-string-singular
msgid_plural untranslated-string-plural
msgstr[0] translated-string-case-0
...
msgstr[N] translated-string-case-n
Such an entry can look like this:
#: src/msgcmp.c:338 src/po-lex.c:699
#, c-format
msgid "found %d fatal error"
msgid_plural "found %d fatal errors"
msgstr[0] "s'ha trobat %d error fatal"
msgstr[1] "s'han trobat %d errors fatals"
Here also, a msgctxt context can be specified before msgid,
like above.
The previous-untranslated-string is optionally inserted by the
msgmerge program, at the same time when it marks a message fuzzy.
It helps the translator to see which changes were done by the developers
on the untranslated-string.
It happens that some lines, usually whitespace or comments, follow the very last entry of a PO file. Such lines are not part of any entry, and will be dropped when the PO file is processed by the tools, or may disturb some PO file editors.
The remainder of this section may be safely skipped by those using a PO file editor, yet it may be interesting for everybody to have a better idea of the precise format of a PO file. On the other hand, those wishing to modify PO files by hand should carefully continue reading on.
Each of untranslated-string and translated-string respects the C syntax for a character string, including the surrounding quotes and embedded backslashed escape sequences. When the time comes to write multi-line strings, one should not use escaped newlines. Instead, a closing quote should follow the last character on the line to be continued, and an opening quote should resume the string at the beginning of the following PO file line. For example:
msgid ""
"Here is an example of how one might continue a very long string\n"
"for the common case the string represents multi-line output.\n"
In this example, the empty string is used on the first line, to
allow better alignment of the H from the word ‘Here’
over the f from the word ‘for’. In this example, the
msgid keyword is followed by three strings, which are meant
to be concatenated. Concatenating the empty string does not change
the resulting overall string, but it is a way for us to comply with
the necessity of msgid to be followed by a string on the same
line, while keeping the multi-line presentation left-justified, as
we find this to be a cleaner disposition. The empty string could have
been omitted, but only if the string starting with ‘Here’ was
promoted on the first line, right after msgid.2 It was not really necessary
either to switch between the two last quoted strings immediately after
the newline ‘\n’, the switch could have occurred after any
other character, we just did it this way because it is neater.
One should carefully distinguish between end of lines marked as ‘\n’ inside quotes, which are part of the represented string, and end of lines in the PO file itself, outside string quotes, which have no incidence on the represented string.
Outside strings, white lines and comments may be used freely.
Comments start at the beginning of a line with ‘#’ and extend
until the end of the PO file line. Comments written by translators
should have the initial ‘#’ immediately followed by some white
space. If the ‘#’ is not immediately followed by white space,
this comment is most likely generated and managed by specialized GNU
tools, and might disappear or be replaced unexpectedly when the PO
file is given to msgmerge.
For the programmer, changes to the C source code fall into three
categories. First, you have to make the localization functions
known to all modules needing message translation. Second, you should
properly trigger the operation of GNU gettext when the program
initializes, usually from the main function. Last, you should
identify, adjust and mark all constant strings in your program
needing translation.
gettext declarationPresuming that your set of programs, or package, has been adjusted
so all needed GNU gettext files are available, and your
Makefile files are adjusted (see Maintainers), each C module
having translated C strings should contain the line:
#include <libintl.h>
Similarly, each C module containing printf()/fprintf()/...
calls with a format string that could be a translated C string (even if
the C string comes from a different C module) should contain the line:
#include <libintl.h>
gettext OperationsThe initialization of locale data should be done with more or less the same code in every program, as demonstrated below:
int
main (int argc, char *argv[])
{
...
setlocale (LC_ALL, "");
bindtextdomain (PACKAGE, LOCALEDIR);
textdomain (PACKAGE);
...
}
PACKAGE and LOCALEDIR should be provided either by
config.h or by the Makefile. For now consult the gettext
or hello sources for more information.
The use of LC_ALL might not be appropriate for you.
LC_ALL includes all locale categories and especially
LC_CTYPE. This latter category is responsible for determining
character classes with the isalnum etc. functions from
ctype.h which could especially for programs, which process some
kind of input language, be wrong. For example this would mean that a
source code using the ç (c-cedilla character) is runnable in
France but not in the U.S.
Some systems also have problems with parsing numbers using the
scanf functions if an other but the LC_ALL locale category is
used. The standards say that additional formats but the one known in the
"C" locale might be recognized. But some systems seem to reject
numbers in the "C" locale format. In some situation, it might
also be a problem with the notation itself which makes it impossible to
recognize whether the number is in the "C" locale or the local
format. This can happen if thousands separator characters are used.
Some locales define this character according to the national
conventions to '.' which is the same character used in the
"C" locale to denote the decimal point.
So it is sometimes necessary to replace the LC_ALL line in the
code above by a sequence of setlocale lines
{
...
setlocale (LC_CTYPE, "");
setlocale (LC_MESSAGES, "");
...
}
On all POSIX conformant systems the locale categories LC_CTYPE,
LC_MESSAGES, LC_COLLATE, LC_MONETARY,
LC_NUMERIC, and LC_TIME are available. On some systems
which are only ISO C compliant, LC_MESSAGES is missing, but
a substitute for it is defined in GNU gettext's <libintl.h> and
in GNU gnulib's <locale.h>.
Note that changing the LC_CTYPE also affects the functions
declared in the <ctype.h> standard header and some functions
declared in the <string.h> and <stdlib.h> standard headers.
If this is not
desirable in your application (for example in a compiler's parser),
you can use a set of substitute functions which hardwire the C locale,
such as found in the modules ‘c-ctype’, ‘c-strcase’,
‘c-strcasestr’, ‘c-strtod’, ‘c-strtold’ in the GNU gnulib
source distribution.
It is also possible to switch the locale forth and back between the
environment dependent locale and the C locale, but this approach is
normally avoided because a setlocale call is expensive,
because it is tedious to determine the places where a locale switch
is needed in a large program's source, and because switching a locale
is not multithread-safe.
Before strings can be marked for translations, they sometimes need to be adjusted. Usually preparing a string for translation is done right before marking it, during the marking phase which is described in the next sections. What you have to keep in mind while doing that is the following.
Let's look at some examples of these guidelines.
Translatable strings should be in good English style. If slang language with abbreviations and shortcuts is used, often translators will not understand the message and will produce very inappropriate translations.
"%s: is parameter\n"
This is nearly untranslatable: Is the displayed item a parameter or the parameter?
"No match"
The ambiguity in this message makes it unintelligible: Is the program attempting to set something on fire? Does it mean "The given object does not match the template"? Does it mean "The template does not fit for any of the objects"?
In both cases, adding more words to the message will help both the translator and the English speaking user.
Translatable strings should be entire sentences. It is often not possible t