The difficulties of Localisation…

The title of this post is a little misleading, but I thought I would leave it in place anyway, as it’s not entirely incorrect.

Cory posted On the maddening subtleties of localizing software last week, and being my line of work, I followed the link and read the article by Sean M. Burke and Jordan Lachler. The first thing that struck me was that the article was incredibly code-heavy given that it was about L10n. Turns out it’s more about i18n, but that may just be me splitting hairs. The next thing I noticed was that it was really only the first section that was of passing interest to translators (…who really want to know how hard computer programming can be) – I would posit that the main audience would be linguists and those that design computer languages and/or architectures.

And the quote CD chose was a good one for his headline:

So, you email your various translators (the boss decides that the languages du jour are Chinese, Arabic, Russian, and Italian, so you have one translator for each), asking for translations for “I scanned %g directory.” and “I scanned %g directories.”. When they reply, you’ll put that in the lexicons for gettext to use when it localizes your software, so that when the user is running under the “zh” (Chinese) locale, gettext(“I scanned %g directory.”) will return the appropriate Chinese text, with a “%g” in there where printf can then interpolate $dir_scan.

Your Chinese translator emails right back — he says both of these phrases translate to the same thing in Chinese, because, in linguistic jargon, Chinese “doesn’t have number as a grammatical category” — whereas English does. That is, English has grammatical rules that refer to “number”, i.e., whether something is grammatically singular or plural; and one of these rules is the one that forces nouns to take a plural suffix (generally “s”) when in a plural context, as they are when they follow a number other than “one” (including, oddly enough, “zero”). Chinese has no such rules, and so has just the one phrase where English has two. But, no problem, you can have this one Chinese phrase appear as the translation for the two English phrases in the “zh” gettext lexicon for your program.

Emboldened by this, you dive into the second phrase that your software needs to output: “Your query matched 10 files in 4 directories.”. You notice that if you want to treat phrases as indivisible, as the gettext manual wisely advises, you need four cases now, instead of two, to cover the permutations of singular and plural on the two items, $dir_count and $file_count.

But the main thrust of the article, apart from the self-stated “A phrase is a function; a phrasebook is a bunch of functions“, is a new way of envisaging i18n and provides something that has been missing from the area – competition and challenge to the ubiquitous gettext:

Consider that sentences in a tourist phrasebook are of two types: ones like “How do I get to the marketplace?” that don’t have any blanks to fill in, and ones like “How much do these ___ cost?”, where there’s one or more blanks to fill in (and these are usually linked to a list of words that you can put in that blank: “fish”, “potatoes”, “tomatoes”, etc.) The ones with no blanks are no problem, but the fill-in-the-blank ones may not be really straightforward. If it’s a Swahili phrasebook, for example, the authors probably didn’t bother to tell you the complicated ways that the verb “cost” changes its inflectional prefix depending on the noun you’re putting in the blank. The trader in the marketplace will still understand what you’re saying if you say “how much do these potatoes cost?” with the wrong inflectional prefix on “cost”. After all, you can’t speak proper Swahili, you’re just a tourist. But while tourists can be stupid, computers are supposed to be smart; the computer should be able to fill in the blank, and still have the results be grammatical.

The reason that using gettext runs into walls (as in the above second-person horror story) is that you’re trying to use a string (or worse, a choice among a bunch of strings) to do what you really need a function for — which is futile. Preforming (s)printf interpolation on the strings which you get back from gettext does allow you to do some common things passably well… sometimes… sort of; but, to paraphrase what some people say about csh script programming, “it fools you into thinking you can use it for real things, but you can’t, and you don’t discover this until you’ve already spent too much time trying, and by then it’s too late.”

And the solution presented is one called Maketext, which seems to have, unfortunately, gone the same route as Lojban and Esperanto – interesting as a curio, but never to take off in the way the creators wanted. Having said that, Maketext has some interesting ideas – listed as buzzwords – in particular, inheritance. For example if en_US were the base for English, en_GB would only contain the changes, or differences, rather than being an almost identical copy. Russian and Ukrainian could share grammatical functions as necessary in the same manner.

Finally, the last thing I noticed was that this wasn’t a new article – it was written in 1998, published in 1999 and edited in 2001, which explains the use (or glorification, worship) of Perl. People just don’t feel that way about Perl any more I don’t think.

It’s an interesting read if you dig that sort of thing, and has made me think more on how I would do it differently – since I’ve little interest in learning Perl and realise that Maketext relies heavily upon previously written Perl modules. I think it also serves translators well – I often feel that they sometimes look down their noses at programmers for all the reasons listed in the first section of the article – here they can see that computer scientists think hard about this problem and it’s not an easy one to solve.

Locale::Maketext::TPJ13 — article about software localization