Friday, May 16, 2008

Languages and cultures

It turns out that there is a lot of inconsistency between the projects when it comes to the naming conventions of languages and cultures. So I thought that maybe someone could help me. Right now I am making the following assumptions:
  1. For any language code ab: ab_AB is the same as ab
  2. fy_NL is the same as fy
  3. ga_IE is the same as ga
  4. hy_AM is the same as hy
  5. nb_NO is the same as nb
  6. nn_NO is the same as nn
  7. nds_DE is the same as nds
  8. sv_SE is the same as sv
  9. ur_PK is the same as ur
However, I was not able to determine, if the following language codes are really the same:
  • bn_in and bn
  • gu_in and gu
  • pa_in and pa
  • no and (nb or nn)
  • nds_NFE and nds
I suppose they are the same, but I wouldn't like to offend anybody, so I put those on hold. If my reasoning is wrong, please let me know, so that I could fix it.

10 comments:

Anonymous said...

> 1. For any language code ab: ab_AB is
> the same as ab

Totally wrong. The first component is ISO 639 language code, the second component is ISO 3166 country code. (It's all documented in gettext manual, one info gettext away.)

Anonymous said...

As anonymous wrote, the first is language, the second, country. There are cases where the country actually changes the language (e.g. pt (pt_PT) vs. pt_BR) in others the difference is minor (en_US vs. en_UK) and others where there is really no difference except for date/numeric/currency etc. localization (es_PR vs. es_US).

For your specific examples, mostly the countries are the "default" countries and thus adding them doesn't change anything (e.g. fy [Frisian] is only used in NL [Netherlands], so fy and fy_NL are the same). For cases of xx_XX, just do a search on ISO-639 and ISO-3166 codes and it should be clear if they are default, e.g. hy=Armenian, AM=Armenia.

Some cases where this would not be true:

bn_IN (Bengali, India) vs. bn (Bengali, Bangladesh) - probably minor language, currency differences

similarly for pa_IN (Panjabi, India) vs. pa (Panjabi, Pakistan)

however, most speakers of gu (Gujurati) are in India, so gu=gu_IN

no is Norwegian (unspecified) while nn is Nynorsk and nb is Norwegian Bokmal, which are different dialects. generally, no implies nb as it is the more common, but strictly this is not true

nds_DE=nds - I don't know what country code is NFE (they should all be two letters) so I'm thinking nds_NFE is a typo?

@alex

Unknown said...

Thanks for the comments! They are very helpful.

I am throwing country codes in order to shorten them. So pt_PT becomes pt in open-tran.eu and pt_BR becomes pt-br (although pt_BR.open-tran.eu works as well). The reason for this is that some teams simplify it and some not and I would like to provide as comprehensive set of translations as it is possible.

There are several problems that I am running into. For example the gnome-applets has (among others) the translations to es and es_ES (sic!). Which makes me wonder: what is the difference between those two?

I came to the same conclusion with Norwegian, but the number of no phrases is negligible, so I'll just ignore them.

But nds_NFE turned out to be "Northern Low Saxon variates grown on Frisian substrate". I've spent some time in Germany and I know how picky they are, when it comes to their dialects and languages. I'll better leave it out as well ;)

Anonymous said...

"ab" is sort of a backup for "ab_*". If a software is run in "pt_PT", "pt_BR" or other "pt_*" and can't find messages for the locale, it will look for messages for "pt".

Brazil and Portugual frequently use different translations for common, basic computer terms. That's a long time divergence, so in the near futures we will probably continue to have separate translation teams, even if in a few years the orthography will be unified. There are a lot of "pt_*" locales other than Brazil and Portugal. These locales have different currencies, timezones etc., but they don't have their own translation teams. So, when gettext doesn't find translations to that specific locale, it uses the default "pt" translations. That's why translations teams in Portugal save their translations to the "pt" locale, not "pt_PT": to provide translations for other countries, too.

On the other hand, Spanish and Arab have only one translation team per language, at least in GNOME. These languages are spread in a very large part of the world, so of course there are regional differences, but they were able to tolerate them. You can safely ignore the residual es_* translations, because they were abandoned a long time ago.

But, when I said "Spanish", I meant "Castillian". That's the official dialect in Spain, and the one spoken in the Spanish part of Latin America. There other dialects in Spain, and they are as different from Castillian as the Portuguese language. Galician (gl_ES) and Catalan (ca_ES) are examples of how important it is for a language to have borders and armies.

Chinese is also composed of very different dialects, the better known beeing Mandarin. These dialects have the same orthography (and borders and armies), so when it comes to software translation we can treat it them as a single language. There are however Hong Kong and Taiwan variations, which use another orthography, the traditional one. In GNOME, the Hong Kong and Taiwan Chinese are two sepparate languages, with the same translation team.

I have no knowledge of Norwegian, though ;)

Unknown said...

You guys are great! Now it all makes sense to me - thanks a lot!

Anonymous said...

eu_ES and eu_FR can be summarized in eu too.

Although there are several dialects both on the northern (FR) and southern (ES) territories, we use an unified "version" called "Batua".

Oh, and don't get confused with Leonardo's message: Galician (gl_ES) and Catalan (ca_ES, ca_AD) are not dialects, they're languages. Basque of course is a language too, but has pretty much differences than the others.

Unknown said...

Thanks Julen! I will make sure that dialects of Basque are treated properly.

Unknown said...

And gl_ES and gl are the same, because the galician spoken in Spain is the same that galician spoken in all the world.

Anonymous said...

It is very interesting for me to read that blog. Thanx for it. I like such topics and everything connected to this matter. I definitely want to read more soon.

Anonymous said...

It was extremely interesting for me to read the post. Thanks for it. I like such themes and everything connected to this matter. I definitely want to read more on that blog soon.