Mapping Unix Locale Identifiers to BCP-47
Format of Unix Locale Identifier
If the locale value has the form: language[_territory][.codeset] it refers to an implementation-provided locale, where settings of language, territory and codeset are implementation-dependent. [Some categories can be] defined to accept an additional field "@modifier ", which allows the user to select a specific instance of localisation data within a single category (for example, for selecting the dictionary as opposed to the character ordering of data). The syntax for these environment variables is thus defined as: [language[_territory][.codeset][@modifier]]
The functions recognize the format of the value of the environment variable. It can split the value is different pieces and by leaving out the only or the other part it can construct new values. This happens of course in a predictable way. To understand this one must know the format of the environment variable value. There is one more or less standardized form, originally from the X/Open specification: language[_territory[.codeset]][@modifier]
|Clearly the territory, codeset and modifier sections are all equally optional. i.e. in the case of a string such as tt_RU@iqtelif tt is the language, RU is the territory, the codeset is empty and the modifier is iqtelif. Equally for tt_RU@iqtelif.foo tt is the language, RU is the territory, the codeset is empty and the modifier is "iqtelif.foo". Parsing this by just searching for the first "_", then subsequent "." and then subsequent "@" would give tt for language, "RU@iqtelif" for territory and foo for encoding which is bogus.|
For the most part the simple case is, when excluding the encoding, that the Unix identifier is language_territory and that language-territory would form a valid bcp subtags. There are other cases to consider through of two main categories
use of collective and/or obsolete languages
- glibc continues to support locales identified by long names, e.g. deutsch, japanese.sjis.
- glibc also continues to support the obsolete language codes of no and iw
- glibc has some locales like ber_DZ and ber_MA. ber is now classified as a collective language so its unclear from the identifier itself as to what specific language is truly indicated
These are effectively free-form, but the existing modifiers in glibc break down into...
- @modifiers that indicate a particular currency, e.g. en_IE@euro
- @modifiers that indicate a non-default script, e.g. uz_UZ@cyrillic, be_BY@latin
- @modifiers that indicate a dialect or variant of the language, e.g. aa_ER@saaho, ca_ES@valencia
- @modifiers that indicate a non-default collation rule, e.g. gez_ER@abegede
- @modifiers that indicate that East Asian ambiguous width characters should default to being considered narrow, e.g. zh_CN@cjknarrow
- Substitute any identifiers appearing in locale.alias according to those aliases
- Parse to language, territory, encoding, modifier
- Substitute language of iu to he
- Substitute language of no to nb
- Ignore @euro modifier, it's redundant now
- Ignore @cjknarrow modifier, it's orthogonal information
- Convert the modifiers of...:
- "cyrillic" to script-tag of "Cyrl"
- "latin" to script-tag of "Latn"
- "devanagari" to script-tag of "Deva"
- "iqtelif" to script-tag of "Latn" (?)
- aa_ER@saaho claims "Afar language locale for Eritrea (Saaho Dialect)", but ssy denotes "Saho, A language of Eritrea. Very similar to Afar". Convert aa to ssy when the @modifier is saaho, i.e. ssy-ER
- ca_ES@valencia, valencia is a registered BCP 47 variant so, ca-ES-valencia
- gez_ER@abegede claims "Abegede Collation for Ge'ez", there seems to be no existing tag to indicate this anywhere, suggest a private tag of x-abegede for the interim, i.e. gez-ER-x-abegede
Debatable issues surround the ber_ family
- ber_DZ locale claims it's "Amazigh language locale for Algeria (latin)", Algeria has apparently standardized on Kabyle, writing in Latin. so maybe could convert ber_DZ to kab-DZ when territory is DZ.
- ber_MA locale claims it's "Amazigh language locale for Morocco (tifinagh)". It's a little unclear as to what exactly is specified here in the absence of a "Standard Amazigh"/"Standard Tamazight". (It's of no help to e.g. examine the translations in the glibc locale description file to see what language they were written in because they are all just copied from the Azerbaijani locale file, and so aren't in any Berber Language!) But the languages being taught through Tifinagh in Morocco seem to be rif, tzm and shi where tzm and shi have approximately 3 million speakers to rif's 1.5. In either case there seems to be or have been plenty of controversy about the script itself, so adding the Tfng script tag to add some distinguishing information to the tag seems called for, especially as there's no suppress-script field in the language-subtag-registry entry for those language.
- http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html POSIX:2008
- gettext manual
- rhbz#589138 Oddly named tt_RU@iqtelif.UTF-8/tt_RU.utf8@iqtelif.UTF-8 locales.
- gnome#618108 Fix glib locale splitter
- xdg#19881 Berber orthographies in Latin and Tifinagh
- xdg#19869 fontconfig should change to BCP 47 language tags