The things that make me go whoop: Tatoeba

It’s not often you metaphorically whistle under your breath at something cool, but that’s just what I did when I learnt about Tatoeba. A site dedicated to an open, collaborative translation of sentences rather than words.

“Why sentences? You may ask. Well, because sentences are more interesting. Sentences bring context to the words.”

Even more fascinating is the contextualisation of what they are doing in comparison to what everyone else is doing – in particular the failing of platforms or software that does language pairing in that some languages get left behind, some language pairs are never consumated (Icelandic/Swahili anyone?). The openness and transparency about their motivations and methods is refreshing compared to Facebook and Google – the description of how they are overcoming the language pairing issue harks back to the mathematical principle of transitive relations (ie whenever A = B and B = C, then it’s also true that A = C) – all languages are interconnected – “awesome, right?”. Absolutely.

The video below is a great introduction, and with Universal Subtitles, it’s available in 8 languages (Italian, the ninth, is only at 27% translated). This makes perfect sense – the very idea underlying Tatoeba completes the loop that these two projects make, and proves that the old adage of parts being greater than the whole.

They provide a range of tools for the glyph based Chinese and Japanese languagesa visualisation of “what’s going on now”; and the sentences can be downloaded, although there is a warning:

The data you will find here will NOT be useful unless you are coding a language tool or doing some work on data processing.

If you want data that you can use as a humble language learner, you can check out the lists section where you can build your own lists of sentences or view others’ lists and print them.

You can make your own lists, each of which can be downloaded! This is simply amazing – my mind is boggling at the possibilites (as a computer scientist). If anything tells me that this will be a must watch platform/site, it’s that the famous meme In Soviet Russia… is already getting translated.

For machine translation freaks, Tatoeba will be producing one of the best corpora available – and it will be CC licensed. In terms of what else is available – this is very similar to the offerings of Chrome/Google+ and Facebook – but those platforms aren’t open, and aren’t giving back to the community that created those translations openly, for that community to build upon. The functions are only there to be monetised, for that community to become a commodity. Which makes Tatoeba all the more interesting:

It’s part of an ecosystem that we want to build. We want to bring language tools to the next level. We want to see innovation in the language learning landscape. And this “cannot” happen without open language resources which can not be built without a community, which can not contribute without efficient platforms. So ultimately, with Tatoeba, we are only building the foundations…to make the Web a better place for language learning.