Superlinguo – all things linguistic-y

I’ve been reading the Superlinguo blog recently, as one of those responsible, Georgia, also appears on Triple R‘s (my local indy radio station) tech show, Byte Into It.

I’ve been meaning to bring them up for a while – there’s an interesting post about the World Oral Literature Project that I thought was very interesting – they have helped fund a dictionary for Lamjung Yolmo, a Nepalese language or dialect.

The World Oral Literature Project are working with communities and linguists all over the world to try and make accessible records of languages that are dying out at a rapid rate. Their particular interest is in capturing those stories, poems and bits of cultural lore that are often lost when a community no longer speak their ancestral language. They have small grants to help people work towards recording these stories and tales, but they also do public lectures and workshops in developed countries to show people who might have never contended with the loss of a language exactly what is as stake.

Then of course, they forced my hand today by posting a cornucopia of linkage to all sorts of language and linguistic goodness that will no doubt make it hard for me to get any work done today. As you can see, a lot of our (well, my) favourite topics are covered – X number of words for Y, untranslatable phrases, crowd sourced translations and endangered languages:

Lynneguist spent a month looking at words that don’t really translate very well between American and British English, as an Australian I’m unsurprised that we share so many commonalities with both – but also amused at how many words from either language haven’t found their way to Australia. Johnson also investigated the great Atlantic linguistic divide, looking at just how Brits living in the USA have adapted to local pronunciation. Results come in colourful pie charts.

Fritinancy reintroduced us to some tech jargon already lost to history, and some that still survives. Stan Carey gives us an introduction to how Klingon was invented, and while still on something of a Scifi theme introduced us to the Spaceage Portal of Sentence Discovery. And while we are looking into the future, the folk at MacMillan reported on the future of dictionaries from the 2011 eLEX conference.

While the internet is having affects on the way dictionaries are being used, Piers Kelly at Fully (Sic) also showed us that crowd-sourcing can be great, with a project currently underway to translate ancient Greek texts. You don’t even need to know any Greek to help out. And on the topic of the internet making research more wonderful, the Australian Society for Indigenous Linguistics have made a large segment of their collection publicly accessible – Thanks to Jane Simpson at the PARADISEC Endangered Languages and Cultures blog for letting us know the good news.

Some quick links – Language Hat asked about the history of movie pidgins, Arnold Zwicky puzzles over some tricky alphabetising and that guy over at Dialect blog talks about guy, as do those guys at Lingua Franca. Ben Zimmer discovers that Kate Bush shows remarkable creativity in her list of 50 words for snow but as Geoff Pullum, over at Lingua Franca, discovers not everyone is as well educated when it comes to knowing how many words Eskimos have for snow (clue: it’s not fifty).

It’s now a solid part of my morning reading ritual through my RSS reader. Recommended reading.

The French add new words

Sometimes the existence of The Académie Française makes me think of the people in charge of the scrabble dictionary. Having said that, at least the Académie is a federal body, unlike our own example – some committee in the employ of the OED and reported as filler in the odd parts of the paper that nothing else fits into.

dev/null writes of a recent “festival of new words” that was embarked upon:

Of course, whether the winners make it into the official draft of the French language is another matter; while the Académie may unilaterally coin indigenously French neologisms, getting people to use them is another matter. (The Académie’s word for electronic mail, courriel, seems to have been unsuccessful, with the anglicism “e-mail” instead gaining currency.) Chances are this contest is intended more to promote experimentation with the expressive possibilities of the French language.

I see something like the Académie Française as being inherently conservative, but this promotion of experimentation is a positive sign of a language alive.


dotSUB emails

dotSUB has a sporadical email blast that most often ends up in the trash, but the latest was very interesting.

The video of the CEO David Orban talking to Forbes magazine has a few interesting points about how subtitling can help make an online video go viral:

“The example is that of a video in German—a six-minute piece of investigative journalism about a compensation scandal at the European Parliament—which has been uploaded by a user on dotSUB, and gained a few hundred views per day, even if it has been translated in English”, he recalls. “Until another user translated it in Czech, and it exploded, gaining over 900,000 views in the Czech Republic in three weeks.”

A few weeks later,  ”somebody else translated it in French, and once again, the video went viral in France, gaining over 600,000 views in two weeks.” Subtitles online can crash through the language barrier, offering more than mere translation. Sometimes it’s just the right turbocharge.

Remember to bring your own grains of salt – dotSUB does good work, but they are still a commercial firm.

Of more interest to me, and to those of you with a linguistic bent, is the top ten list of languages by number of native speakers. It’s not amazingly suprising: Mandarin, Spanish and English being the top three for instance, but is interesting none the less. It’s ripped straight from Wikipedia’s more comprehensive list of languages by number of native speakers and given the tabloid top ten treatment.

The wiki page is at least honest enough to understand that there will be some ambiguity when it comes to “what is a language”:

Since the definition of a single language is to some extent arbitrary, some mutually intelligible idioms with separate national standards or self-identification have been listed together, including Hindi-Urdu; Indonesian and Malay; Croatian, Bosnian and Serbian; Punjabi; Tibetan, etc.

Most interesting to me were the ordering from about position #9 – Japanese, in itself a surprise to me. One of those strange surprises where you think “really? That high?” before you read the rest of the list and slowly nod to yourself.

That Javanese comes above languages like French is odd, as is the number of Dravidian languages that come above European languages. I feel like there are some language groups that have a finer splitting than others, in a way that gives the impression of unequal treatment. But I’m not an expert, it just seems like that to me.

Chrome Language Detection

Google’s Chrome browser has a built in function for detecting the language of a website and offering a translation of the site if the language isn’t in your local language (and Google translates between those languages) – roughly 64 languages iirc.

Known as Compact Language Detection (CLD), it’s been extracted from the Open Source browser code base by blogger Mike McCandless, and ported into a stand alone product on Google code that can now be integrated into any c++ project, as well as some simple Python bindings.

It’s also not clear just how many languages it can detect; I see there are 161 “base” languages plus 44 “extended” languages, but then I see many test cases (102 out of 166!) commented out.  This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.

Excitingly, since it was first posted, Mike has a couple more posts on this library – this one details the addition of some Python constants and a new method removeWeakMatches and another that compares the accuracy and performance between CLD, and two java based projects: the Apache Tika project and the language-detection project:

Some quick analysis:

  • The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. Net/net these accuracies are very good, especially considering how short some of the tests are!
  • The difficult languages are Danish (confused with Norwegian), Slovene (confused with Croatian) and Dutch (for Tika and language-detection). Tika in particular has trouble with Spanish (confuses it with Galician). These confusions are to be expected: the languages are very similar.

When language-detection was wrong, Tika was also wrong 37% of the time and CLD was also wrong 23% of the time. These numbers are quite low! It tells us that the errors are somewhat orthogonal, i.e. the libraries tend to get different test cases wrong. For example, it’s not the case that they are all always wrong on the short texts.

This means the libraries are using different overall signals to achieve their classification (for example, perhaps they were trained on different training texts). This is encouraging since it means, in theory, one could build a language detection library combining the signals of all of these libraries and achieve better overall accuracy.

You could also make a simple majority-rules voting system across these (and other) libraries. I tried exactly that approach: if any language receives 2 or more votes from the three detectors, select that as the detected language; otherwise, go with language-detection choice. This gives the best accuracy of all: total 99.59% (= 16930 / 17000)!

Finally, I also separately tested the run time for each package. Each time is the best of 10 runs through the full corpus:

CLD  171 msec  16.331 MB/sec
language-detection  2367 msec  1.180 MB/sec
Tika  42219 msec  0.066 MB/sec

CLD is incredibly fast! language-detection is an order of magnitude slower, and Tika is another order of magnitude slower (not sure why).

Artatak: Remapping words

While I was living in Yogyakarta in 2008 I had the pleasure of sharing the space at the now defunct (sads!) Mes56. Katerina Valdivia was also staying there at the time. I will always remember the Argentinian/German New Year’s eve feast she cooked on my first night there  – I was just recovering from Dengue fever, having spent Xmas in a fevered stupor – and it was one of the greatest feasts I’d ever tasted.

I got an email from Katerina last week advertising her latest show, titled Remapping Words. Instant attention grabbing headline in my book:

Sometimes words become the staging of symbolic spaces, that attempt to change reality. This is one of the aims of the piece Resignation by Lisha, a work that follows a strategy of redirecting a meaning by altering or adding words. The artist intervenes in the public space subverting the rules that organise it.

With the work Investir, Valeria Schwarz created a participatory and dialogic piece based on three month of Facebook and online chats with people from North African countries. Taking some of the  phrases of their conversations, the artist inserted them in daily life situations in the city of Murcia, Spain. With this, these sentences acquired another meaning through the new geographical context in which they were presented.

Using subtitles, Stine Eriksen creates in the video Choreography #1 a tension between the word and its display, showing the impossibility of words to fill the absence on which language is based.

I can’t make it (wrong side of the planet), but if you are near Berlin – check it out and let me know!



A Map of Twitter in non English languages

There’s a fantastic map available that shows which languages twitter is being used in and where. I found this map via this Big Think post here which has a great breakdown:

What does this map tell us? First of all, like those world-at-night maps, it shows us where all the people are – at least those tweeting. Western Europe is lit up like a christmas tree – with the Netherlands glowing especially bright. Eastern Europe: not so much. Russia is a spider’s web of large cities connected through the darkness of the vast, empty countryside. In East Asia, Japan, South Korea and Indonesia stand out. India is much darker – but maybe that’s because English, no longer majoritary but still dominant, is rendered in subdued grey.

The Middle East is half-lit, but across the arid dunes of Saudi Arabia rather than along the Fertile Crescent. Africa remains Twitter’s darkest continent. The Americas are illuminated in all the usual places: the eastern half and the western coast of the US, with high densities throughout Central America, the Caribbean and the shores of South America, and low densities in its centre.

But the map most of all tells us which language people are twittering in. This is a fascinating way to compare official and actual language use. Quebec, for example, is an enormous French-speaking territory, almost triple the size of France itself (2). But those Canadians actually practising le tweetin French (3) form a much smaller cluster, huddling around the St Lawrence in a couple of large hubs, with only a few francophone flecks further afield.

The rest of North America is solidly, and massively, anglophone, with only a surprisingly small smattering of Spanish in those areas with large hispanic populations. The US-Mexican border, for all its supposed permeability, is still clearly visible on this map, which shows Spanish dominating most of the rest of the Americas – although Cuba remains as dark as the night.

The fun really begins in Europe, where some of countries just vanish off the map: Belgium tweets in Dutch and French, Switzerland in mainly in German, with a French bit west of the Röstigraben (3). And other countries emerge out of nowhere: Catalans twitter in their own language, not Spanish. German dominates Central Europe, but a surprisingly large chunk of Austria appears to be tweeting in Italian – as do a lot of dots inside France.

Those are the really fascinating bits of this map of Twitter’s languages: the ones that show a divergent reality to the one we find on most other maps – even ‘proper’ linguistic ones: is that blue dot south of Amman really a Danish oasis in the Jordanian desert? Does nobody tweet in Lithuanian? And is that Spanish being tweeted in Bermuda?

GlotPress – translating WordPress

I use WordPress quite a lot and was surprised to discover that I knew so little about it’s efforts at Internationalisation. I recently stumbled upon GlotPress, the web software built to make simultaneous, crowd sourced, translation of the WordPress codebase easier and was impressed.

At you can see the projects that are being translated, and by clicking through one can start translating the main codebase, versions, themes – all of the base software projects offered by the WordPress community. There’s a getting started guide that explicitly explains how it works, although the site and software is fairly intuitive.

I have my criticisms though – the GlotPress software itself doesn’t have a good landing or about page – not very friendly to those that want quick information about what it is, how it works and where it can be seen in action. I’m surprised at how few languages are available for some sections, and having the WordCamp theme in the root directory, but the default WordPress theme TwentyEleven inside the WordPress project was confusing at first, although I’d probably make more sense of it if I delved deeper.

It’s certainly doesn’t seem as slick or well developed as Transifex the platform used by Django. Having said that, it obviously works, and is being developed and improved from what I see on the GlotPress blog. Coupled with being fairly intuitive to use, I think this is more important than my criticism. I’m looking forward to see how it develops over time.