Localization for developers
Localization, often abreviated as l10n, from the perspective of the developer is a multi-step process that involves a variety of tools. Most of these tools as well as the central data format (SDF) are specific to OpenOffice and not used anywhere else. Only a part of the workflow is integrated into the build system. Much of it requires manual steps to be taken. Some of the tools involved are not part of the OpenOffice SVN and, due to a hard disk crash of the old pootle server, are lost.
The actual translation is done with the help of a pootle server. Most of the localization workflow is about uploading and downloading data to/from the pootle server and about extraction/merging of strings from/into source files.
If you are looking for information about how to contribute translations then this page gives an (outdated) overview.
The localization process consists of several steps. The short version look like is this:
- Content Creation
- Write code or help files or any other content that needs localization.
- Once in a while (for every milestone) the localize_sl script is used to extract the strings that need to be localized.
- The sdf file created by localize_sl is uploaded to the pootle server and transformed/converted into po files.
- Translation takes place, either directly via the pootle server's html frontend or via an offline editor.
- The po files are eventually downloaded from the pootle server, converted into sdf files and checked for errors.
- When the office is built with configure switch --with-lang="..." then the english strings are replaced by translated strings from the localize.sdf files. The result is a localized install-set ready to use or a language pack that can be applied to an already installed office.
Write text that needs to be localized. This can be help files, configuration files (.xcu), or resource files (.rsc). Source code does not contain localizable strings directly but uses resource files for that.
Once in a while (for every milestone) run solenv/bin/localize_sl (which forwards the call to solver/340/<platform>/bin/localize_sl<.exe> which forwards it to solver/340/<platform>/bin/localize<.exe>)
localize iterates over all files in the source tree and searches for files that may contain strings that need localization. The found files are processed with one of several extractors (implemented in a variety of languages: C++, Python, Java). The result is one single sdf file.
A typical call to localize looks like this:
localize -e -l en-US -f foo.sdf
The resulting foo.sdf.main (where does the .main suffix come from?) has at the moment (SVN revision 1237934) 72556 lines and 13,063,597 bytes. 45302 lines (9,026,966 bytes) of these belong to the helpcontent2 module.
At the moment localize runs with errors on Windows: jpropex, a shell script that calls a java program does not run. Linux is OK.
Note: On Linux or MacOS you have to use a full qualified path to the output file. Otherwise you won't get an output file and also no error. The tooling seems to be very error-prone. A lot of space for improvements
The sdf file created by localize is uploaded to the pootle server and transformed/converted into po files (not necessarily in this order). Probably integrated into existing po files.
The helpcontent2 module is handled separately from the other modules but to avoid to dishearten translators that work on the UI part (everything not helpcontent2) and do not see progress (due to the larger number of strings in helpcontent2.)
Translation takes place, either directly via the pootle server's html frontend or via an offline editor.
The po files are eventually downloaded from the pootle server, converted into sdf (or converted and then downloaded) and and checked for various errors with gsicheck tool. Then they are integrated into the localize.sdf files in extras/l10n/source/<language>/
When the office is built with configure switch --with-lang="..." then extras/l10n is built and the localize.sdf files are rearranged. In l10n they are grouped according to language. Now they are grouped according to module (and directory.) The sdf files in extras/l10/<platform>/misc/sdf are zipped into one archive per module and delivered into main/solver/340/<platform>/sdf/<module>.zip and then forgotten (at least for the processing of src files.)
Resource files (src files) are processed when the other modules are built. The original src files contain strings only for en_US in lines that look like
Text [en_US] = "...";
transex3 adds the missing languages by adding lines like
Text [de] = "...";
By default all (available) languages are added not just the ones given to configure's --with-lang switch. The augmented src files are placed in <module>/<platform>/misc/... These are then aggregated into some srs files in <module>/<platform>/srs/. In a (or several) following step(s) the srs files are aggregated into res files, one for each language.
The resulting res files are delivered to main/solver and become part of the installation sets. Multi language versions contain res files for more than one language.
At runtime the ResMgr class from the tools module is responsible to use the resource files of the currently selected language whenever a string is requested (as is the case for eg all button texts and in general for all text visible in the GUI.)
Quite a number of different file formats are involved in the localization process. The following list is not complete and may be inaccurate:
- Source files of resources. Most strings used in the GUI are defined in .src files.
- Header files of .src resource files.
- Made by rsc (which calls rscpp and rsc2) from multiple src files with *all* language strings included.
- Created by transex3 from .srs files.
- Used to store localized/localizable strings and their origins. Comparable to .po files.
- Created by gettext from source files. Contains strings that need translation. Not used by OpenOffice.
- Contains the translated strings from a .pot file. Used on the pootle server.
- Help files of OpenOffice. Another source for strings that need translation.
- A format with the same usage of .po, but it has more functionalities and is standardized.
A large number of tools, implemented in a variety of languages (C++, Java, Perl, Python, sh) are involved in the localization process. They mostly extract strings from source files and merge the translated strings back in, or transform between different data formats.
The following list is not (yet) complete and may (still) be inaccurate:
- called from localize to extract strings from .tree or .xtx files.
- called from localize to extract strings from .ulf files.
- called from localize to extract strings from .xrb, .xxl, and .xgf files.
- called from localize to extract strings from .xcd, .xcu, and .xcs files.
- called from localize to extract strings from .xrm files.
- called from localize to extract strings from .xhp files.
- called from localize to extract strings from .properties files.
The current localization workflow as outlined above has several drawbacks.
- The workflow looks more like an ad-hoc solution than a designed approach.
- The tools involved are written in a variety of languages: C++, Java, Perl, and Python. This is not bad in itself. For example it makes sense to parse Java property files with Java code. But there is also C++ code for iterating over the tree of source files that uses hard coded lists of other executables and scripts for processing individual files. That leads to many processes to be created and destroyed, something that is notoriously slow on Windows.
- Some of the tools are not used anymore. For example I did not find any .xtx, .xrb, .xxl, .xgf, or .xcd files. Therefore the xbtxex and xmlex tools can be dropped. (May have already happened for xmlex) Others are used but do not run (like the jpropex tool). And then there is our own preprocessor for handling resource files, which might be replaceable by the standard C/C++ preprocessor (which parses the included hrc files anyway since they are included in C++ code.)
- OpenOffice uses its own non-standard file format (SDF) for handling localized strings. In order to use a pootle server for the actual translation, all .sdf files have to be transformed into .po files and, after translation, back into .sdf files. It should be also taken into consideration a future migration to xliff format for translation handout.
- The localization workflow is convoluted and hard to understand. Much tooling is involved outside the build process. This results in a manual process that is undocumented and known only to a select few. Some of this tooling seems to be lost after a disk crash of the old OpenOffice pootle server: it was not even contained in the source code repository.
Here is a list of things for improving the localization workflow:
- Understand the current workflow better by analyzing and documenting it.
- Get rid of the .sdf files. Use the .po or .xliff files directly. The file format does not seem to be much more complicated. The transformation from .sdf to .po and back again would not longer be necessary. Widely used tools (used and developed outside the OpenOffice project) could be used.
- Streamline the number and implementation of the tools used for extraction and merging of localizable strings. Use the right language for each task.
- Integrate the string extraction into the build process. Most of the files that can contain localizable strings are already part of the build system, mostly for the merge process. For example there are make-rules for transforming and merging rsc files into .srs and then into .res files. Add rules for the string extraction. This would allow developers to count new strings and the buildbot could extract the new strings and upload them to the pootle server.