There are many different ways for encoding a particular string into a
PO file entry, because there are so many different ways to split and
quote multi-line strings, and even, to represent special characters
by backslashed escaped sequences. Some features of PO mode rely on
the ability for PO mode to scan an already existing PO file for a
particular string encoded into the
msgid field of some entry.
Even if PO mode has internally all the built-in machinery for
implementing this recognition easily, doing it fast is technically
difficult. To facilitate a solution to this efficiency problem,
we decided on a canonical representation for strings.
A conventional representation of strings in a PO file is currently
under discussion, and PO mode experiments with a canonical representation.
xgettext and PO mode converging towards a uniform
way of representing equivalent strings would be useful, as the internal
normalization needed by PO mode could be automatically satisfied
xgettext from GNU
gettext. An explicit
PO mode normalization should then be only necessary for PO files
imported from elsewhere, or for when the convention itself evolves.
So, for achieving normalization of at least the strings of a given PO file needing a canonical representation, the following PO mode command is available:
Tidy the whole PO file by making entries more uniform.
The special command M-x po-normalize, which has no associated
keys, revises all entries, ensuring that strings of both original
and translated entries use uniform internal quoting in the PO file.
It also removes any crumb after the last entry. This command may be
useful for PO files freshly imported from elsewhere, or if we ever
improve on the canonical quoting format we use. This canonical format
is not only meant for getting cleaner PO files, but also for greatly
msgid string lookup for some other PO mode commands.
M-x po-normalize presently makes three passes over the entries.
The first implements heuristics for converting PO files for GNU
gettext 0.6 and earlier, in which
fields were using K&R style C string syntax for multi-line strings.
These heuristics may fail for comments not related to obsolete
entries and ending with a backslash; they also depend on subsequent
passes for finalizing the proper commenting of continued lines for
obsolete entries. This first pass might disappear once all oldish PO
files would have been adjusted. The second and third pass normalize
msgstr strings respectively. They also
clean out those trailing backslashes used by XView’s
for continued lines.
Having such an explicit normalizing command allows for importing PO
files from other sources, but also eases the evolution of the current
convention, evolution driven mostly by aesthetic concerns, as of now.
It is easy to make suggested adjustments at a later time, as the
normalizing command and eventually, other GNU
should greatly automate conformance. A description of the canonical
string format is given below, for the particular benefit of those not
having Emacs handy, and who would nevertheless want to handcraft
their PO files in nice ways.
Right now, in PO mode, strings are single line or multi-line. A string goes multi-line if and only if it has embedded newlines, that is, if it matches ‘[^\n]\n+[^\n]’. So, we would have:
msgstr "\n\nHello, world!\n\n\n"
but, replacing the space by a newline, this becomes:
msgstr "" "\n" "\n" "Hello,\n" "world!\n" "\n" "\n"
We are deliberately using a caricatural example, here, to make the point clearer. Usually, multi-lines are not that bad looking. It is probable that we will implement the following suggestion. We might lump together all initial newlines into the empty string, and also all newlines introducing empty lines (that is, for n > 1, the n-1’th last newlines would go together on a separate string), so making the previous example appear:
msgstr "\n\n" "Hello,\n" "world!\n" "\n\n"
There are a few yet undecided little points about string normalization, to be documented in this manual, once these questions settle.