Google Translate reaching further

Somehow I missed it at the end of last year, but Google Translate has added nine new languages – four from the African continent, three from Asia, and Maori.

In Africa, we’re adding Somali, Zulu, and the 3 major languages of Nigeria.
  • Hausa (Harshen Hausa), spoken in Nigeria and neighboring countries with 35 million native speakers
  • Igbo (Asụsụ Igbo) spoken in Nigeria with 25 million native speakers
  • Yoruba (èdè Yorùbá) spoken in Nigeria and neighboring countries with 28 million native speakers
  • Somali (Af-Soomaali) spoken in Somalia and other countries around the Horn of Africa with 17 million native speakers
  • Zulu (isiZulu) spoken in South Africa and other south-western African countries with 10 million native speakers
Throughout Asia, we’re launching languages spoken in Mongolia and South Asia.
  • Mongolian (Монгол хэл), official language in Mongolia and also spoken in parts of China with 6 million native speakers
  • Nepali (नेपाली), spoken in Nepal and India with 17 million native speakers
  • Punjabi language (ਪੰਜਾਬੀ) (Gurmukhi script), spoken in India and Pakistan with 100 million native speakers
Thanks to the volunteer effort of passionate native speakers in New Zealand, we’re adding the language of the Maori people.
  • Maori (Te Reo Māori), spoken in New Zealand with 160 thousand speakers

Unbabel – Translation as a Service

How quickly things change. It’s been a while since I’ve had a chance to look at the state of translation and translation tech, and now it seems that all the latest trends have come together.

Unbabel combines the brash young entrepreneur, the youth in turn brings something akin to ignoratio elenchi – the byline is “Translation as a Service”

Human corrected machine translation service that enables businesses to communicate globally

dutifully adhering to the modern “X as a Service” line so necessary for venture capital funding without understanding the nature of translation (it’s always been a service), and as happens with this style of disruptive tech, poorly paid contractors making management rich.

Despite my reservations about the motivations of Unbabel’s direction and management, and my knowledge of what this will do to the translation industry, this is not unexpected. I’ve written before many times about the coming changes and the shake up the industry should by now be expecting. I would suggest that this is the final ramping up of this process, the next step will be a combination of the collapse of the industry. This will lead to two distinct results – a massive increase in the number of translated texts and a dramatic shrinkage of the employment prospects, but increase in the financial returns for those translators that stick at it long enough.

TechCrunch manages to say a lot

Unbabel’s secret sauce leverages artificial intelligence software and its stable of over 3,100 editors (or translators) to translate a website’s content from one language into its customer’s language of choice. First, its machine learning technology translates the text from source into the target language, at which point it uses its Mechanical Turk-style distribution system to assign editing tasks to the right translators, who then check the translation for errors and for stylistic inconsistencies.

Unbabel editors work remotely, via their laptops or mobile phones, on translations, which co-founder Vasco Pedro says provides the key to faster translations. This, combined with the efficiency of its task distribution and administration algorithms, provides a level of efficiency that allows editors to earn up to $10/hour working for Unbabel.

but without much analysis – the technology sector and it’s loyal heralds have never been good at analysis that didn’t revolve around profit and where it’s coming from

Human translation is really the gold standard as far as online translation goes, but for most companies, paying real, live humans to translate their content is an expensive proposition. In most cases, it’s either pony up the funds to pay for humans, or make due with machines (like publicly available tools akin to the unreliable Google Translate) and automated services. By combining both machine translation and human curation, the Unbabel founders not only believe they’ve created a novel solution to a persistent problem, but that they can offer a product that’s on par with pure human translation, faster, and at a fraction of the cost.

Note here the only mention is a “expensive proposition” and “fraction of the cost”. This was to be expected, and I lectured the translation industry that they should expect it. I did not expect the young turks to dismiss the expensive past without even an acknowledgement of the history, theory or purveyors of that industry. I guess that’s why they call them the blues.

Mostly about language

I don’t like blogging like this, but it’s hard to find the time with an intermittent Internet. I find titbits, but I rarely follow links – I’ve not watched an online video in almost a year and my inbox has an email thread containing 276 emails with over 400 links to “revisit” once I return to the land of faster bandwidth. As though anyone on the Internet has time for 400+ old links.

However as someone that is interested in language, it behooves me to relay this content that I’ve found.

I don’t know why I have a low opinion of Will Self, but I do. As a self important anarchist I think that I rub up against other self important *ists. Despite this I found his latest piece for the BBC, In defence of obscure words, a rollicking good skewering of the stupid, the vapid, the empty. Be it expressing a love of words and language and using them:

I’d point out that my texts were as full of resolutely Anglo-Saxon slang as they were the flowery and the Latinate. I’d observe that English, being a mishmash of several different languages, had a large and exciting vocabulary, and that it seemed a shame not to use it – especially given that it went on growing all the time, spawning argot and specialist terminology as freely as an oyster does its milt.

or the end result of a culture built by the risk adverse:

But now that all formerly difficult subject matter is, if not exactly permitted, readily accessible, cultural artificers have no need to aim high. The displacement of aesthetically and intellectually difficult art as the zenith has resulted in all sorts of sad and interrelated phenomena.

In the literary world, books intended for child readers are repackaged and sold to kidult ones, while even notionally highbrow arbiters – such as Booker judges – are obsessed by that nauseous confection “a jolly good read”. That Shakespeare remains our national writer is, frankly, bizarre, given that with his recondite vocabulary, myriad historical references, and convoluted metaphorical language, were he to be seeking publication in the current milieu, his sonnets and plays would undoubtedly also be branded as ‘too difficult’.

As for visual arts, the current Damien Hirst retrospective at Tate Modern is a perfect opportunity to see what becomes of an artificer whose impulse towards difficult subject matter was unsupported by any capacity for hard cogitation or challenging artistry. The early works – the stuffed animals and fly-bedizened carcasses – retain a certain – albeit recherché – shock value, while the subsequent ones degenerate steadily to the condition of knocked-off merchandise, making the barrier between the gift shop and the exhibition space evaporate in a puff of consumerism.

But the most disturbing result of this retreat from the difficult is to be found in arts and humanities education, where the traditional set texts are now chopped up into boneless nuggets of McKnowledge, and students are encouraged to do their research – such as it is – on the web.

I quite enjoyed the brief moment of intellectual challenge that he poses.

Which is why I now turn to more a phenomena that really only exists because of the Internet but grew from the old style newsprint tropes “Word of the day”, maybe combined with “What in the world” – the longer form list of obscure, obtuse, unused, hard to translate or extinct words. Usually in groups of five, eight or ten. I’m not immune to posting links these lists here on Pineapple Donut, but it’s not often that it’s done anew – as an infographic and without the pronunciation of the words. And to stick it up to Mr Self, I found it though the most internet of ways – in RSS from a tumblr called this isn’t happiness, via mentalfloss, and then PopSci, to the original artist’s site, 21 Emotions with No English Word Equivalents.

At first I was put off by the filter of emotive words, but I came around as I thought about it – not only was Pei-Ying’s choice considered in that it provided a focus that’s easy to explain, empathise with and understand, but it gave her the opportunity to explore feelings that don’t have words in English, or any other language presumably, but are unique and identifiable to the (ahem, current) internet age. Unfortunately the artist’s site was so popular after the various postings that their broadband limit has been blown, or 509’d in tech speak.

I didn’t know that the Talkly awards even existed, but the Crikey language blog, FullySic, noted that last year it given to Ingrid Piller. Awarded for an individual who has done the most to increase public knowledge about language, she sounds like the person we would most like to be sitting next to on the 6am flight from Nadi to Tarawa.

Cory Doctorow fires up more passion in people than I’d expect – I find him interesting, intelligent and sometimes even enthralling, but the argy bargy that follows him is hard for me to comprehend. He writes for the Guardian on the difference between value and price in the internet era, largely focusing on positive externalities and their exploitation. Most interesting to me is his use of Google and it’s approach to translating.

A positive externality arises when you do something you want to do that also makes life better for someone else. For example, if you drive your car slowly and carefully to avoid a wreck, a positive externality is that other users of the road have a safer time of it, too. If you keep up your front garden because it pleases you, your neighbours get the positive externality of slightly buoyed-up property values from living on a nicely kept street.

Positive externalities — virtuous cycles — are all around us. Your kid learns to speak because of all the people around her who carry on conversations and because of the TV shows and radio programmes where speaking occurs (as do immigrants like my grandmother, whose English fluency owes much to daytime TV after she came to Canada from Russia).

Google is a case-study in harvesting positive externalities. It offered a free, voice-based directory assistance number, and used the interactions users had with its software to build a corpus of common phrases, expressed in multiple accents and under a wide range of field conditions. Then it used this to train the voice-recognition software that powers its Android-based phone-search. Likewise, it mined all the publicly available translations on the web – EU documents that appeared in multiple languages, fan-based translations for subtitles on cult cartoons, and everything else it could find – and used this to train its automated translation engine, providing it with the context that it needed to figure out the nuance and sense of ambiguous phrases.

He contends that the defining mania of the internet era is

resentment over positive externalities. Many people and companies have concluded that if someone, somewhere, is getting value from their labour, that they should get a cut of that value… Many people have accused Google of “ripping off” the public by indexing content, or analysing it, or both. Jaron Lanier recently accused Google of misappropriating translators’ labour by using online translated documents as a training set for its machine-translation engine – an extreme version of many labour-oriented critiques of online business.

leading to

the infectious idea of internalising externalities turns its victims into grasping, would-be rentiers. You translate a document because you need it in two languages. I come along and use those translations to teach a computer something about context. You tell me I owe you a slice of all the revenue my software generates. That’s just crazy. It’s like saying that someone who figures out how to recycle the rubbish you set out at the kerb should give you a piece of their earnings. Harvesting positive externalities involves collecting billions of minute shreds of residual value – snippets of discarded string –and balling them up into something big and useful.

While I enjoy his take, either he or Lanier has missed the mark. If Lanier’s critique was purely about the Google Translation Toolkit it would be understandable, but as is pointed out in the comments – the EU have made the translations available for exactly that purpose. Similarly, all the Free and Open Source software translation files have been there in the public domain waiting to be harvested since the movement started in the early 1990s – it was just a matter of someone thinking to harvest the files, and having the hardware and technical expertise to do so. And indeed, those files remain open source – someone else is welcome to harvest the same files. Google hasn’t locked them up. The Translation service on the other hand, asking for Translator’s Translation Memories and storing them – that is taking other people’s work. I guess the question then becomes can Google guarantee that they haven’t used those TMs in their translation service.

Finally, for the real language nerds, Matt Might’s The language of languages is a healthy, if slight, refresher on context free grammars:

Languages form the terrain of computing.

Programming languages, protocol specifications, query languages, file formats, pattern languages, memory layouts, formal languages, config files, mark-up languages, formatting languages and meta-languages shape the way we compute.

So, what shapes languages?

Grammars do.

Grammars are the language of languages.

Behind every language, there is a grammar that determines its structure.

This article explains grammars and common notations for grammars, such as Backus-Naur Form (BNF), Extended Backus-Naur Form (EBNF) and regular extensions to BNF.

The discussion on context sensitive grammars and parsing is poorly explained to my mind, in need of more explanation  and the article in general could be more interesting to the non computer scientist with a little more work. A primer only really.

One step at a time: Google plays the longest game around

Slashdot has informed me of a HuffPo piece by Found in Translation co-author Nataly Kelly about Google hiring Ray Kurzweil, potentially the world’s most eccentric dork.

The beauty of the web shines through when a commentator can sum it up and extrapolate better than the original post:

You need to investigate the entire initiative Google is spearheading around its acquisition of Metaweb. They are building an ontology for human knowledge, and are ultimately building the semantic networks necessary for creating an inference system capable of human level contextual communication. The old story about the sad state of computers’ contextual capacity, recounts the story of the computer that translates the phrase “The spirit is willing, but the flesh is weak.” from English to Russian and back and what they got was “The wine is good but the meat is rotten.”

The new system won’t have this problem. Because it will instantly know about the reference coming from the Bible. I will also know all the literary links to the phrase, the importance of its use in critical historical conversations, The work of the Saints, the despair of martyrs, in short an entire universe of context will spill out about the phrase and as it takes the conversational lead provided by the enquirer it will dance to deliver the most concise and cogent responses possible. In the same way, It will be able to apprehend the relationship between a core communication given in context ‘A’ and translate that conversation to context ‘B’ in a meaningful way.

Ray is a genius for boiling complex problems down into tractable solution sets. Combine Ray’s genius with the semantic toy shop that Google has assembled, and the informational framework for an autonomous intellect will become. The real question is how you make something like that self aware. There’s a another famous story about Helen Keller, before she had language. symbolic reference, she lived like an animal. Literally a bundle of emotions and instincts. One moment, one utterly earth shattering moment there was nothing, then Annie Sullivan her teacher placed her hand in a stream of cold water and signed water in her palm. Ellen understood… water. In the next moment Ellen was born as a distinct and conscious being, she learned that she had a name, that she was. I don’t know what that moment will look like for machines, I just know its coming sooner than we think. I also can’t be certain whether it will be humanities greatest achievement or our worst mistake. That awaits seeing.

Google Translates everything

Is Google Translate heading towards an end game position? Two posts on the products blog would have me believe it is closer than you would think. Just this week Google announced that email translations was moving from a Lab curio to all email users:

We heard immediately from Google Apps for Business users that this was a killer feature for working with local teams across the world. Some people just wanted to easily read newsletters from abroad. Another person wrote in telling us how he set up his mom’s Gmail to translate everything into her native language, thus saving countless explanatory phone calls (he thanked us profusely).

Since message translation was one of the most popular labs, we decided it was time to graduate from Gmail Labs and move into the real world. Over the next few days, everyone who uses Gmail will be getting the convenience of translation added to their email. The next time you receive a message in a language other than your own, just click on Translate message in the header at the top of the message…

If you’re bi-lingual and don’t need translation for that language, just click on Turn off for: [language]. Or if you’d like to automatically have messages in that language translated into your language, click Always Translate. If you accidentally turned off the message translation feature for a particular language, or don’t see the Translate message header on a message, click on the down arrow next to Reply at the top-right of the message pane and select the Translate message option in the drop-down.

The second big hint, in the article, Breaking down the language barrier—six years in, is by one of the Google Translate researchers, includes a short history and context, and the stats are amazing:

Today we have more than 200 million monthly active users on translate.google.com (and even more in other places where you can use Translate, such as Chrome, mobile apps, YouTube, etc.). People also seem eager to access Google Translate on the go (the language barrier is never more acute than when you’re traveling)—we’ve seen our mobile traffic more than quadruple year over year. And our users are truly global: more than 92 percent of our traffic comes from outside the United States.

In a given day we translate roughly as much text as you’d find in 1 million books. To put it another way: what all the professional human translators in the world produce in a year, our system translates in roughly a single day. By this estimate, most of the translation on the planet is now done by Google Translate.

Of course, he repeats the mantra that all professional translators will want to hear:

Of course, for nuanced or mission-critical translations, nothing beats a human translator—and we believe that as machine translation encourages people to speak their own languages more and carry on more global conversations, translation experts will be more crucial than ever.

I think that these two posts are important – not only does Google have enough faith in it’s translations that it can roll them out across potentially the most used email system on the planet, but the statistic of 1 million books a day being translated just goes to show how much off the cuff, non mission critical translation was just waiting to happen.

First post from Kiribati

We have arrived in Kiribati! It’s lovely – the weather has been rough and ready, but hot and wet. The people are lovely and the scenery is quite amazing. I pinch myself every day. The internet connection on the other hand is appalling. And when I say internet connection I mean Internet Connection – there are a few bottle necks, but the most frustrating is that of the national telecom monopoly – their uplink, the main one on the island, is appalling. This blog post is being constructed in a text editor offline on the weekend from tabs I didn’t close on Friday afternoon and it feels quite unnatural. Anyway, more on the Kiribati language is coming, in the mean time I thought I’d mention two articles I noticed during the week.

The first is from Fully(Sic) the Crikey’s language blog about the localisation of comics in the daily papers here in Australia. In focus is the localisation of Zit’s use of mom being changed to mum. The bulk of the artile ruminates on the limited use of localisation from American (or British) into Australian – we have internalised their spellings and language usage over the last 50 years by importing their culture:

The Zits case is different though. We’re quite used to our locally produced content (or British content, for that matter) being edited for US audiences. But changing mom for mum in the Zits cartoon goes the other way. And this is something we’re not used to. We in Australia are effectively bidialectal – we hear US English (and likely other dialects too) very frequently and can effortlessly translate phrases, lexical items and spellings without it even breaching our conscious mind. For this, I suppose we can thank fifty years or more of pervasive US culture dominating our media. Perhaps this is the reason that such substitutions irritate Alan – just like everyone else, he knows that Americans spell it mom, and has no problem understanding it, but critically he also knows that Zits is an American comic strip – the characters’ voices in his head would most probably have American accents. So when he reads mum where he expects mom, it’s clearly going to be quite jarring.

The second article is from the dependable dev/null. A German company have started “creating” t-shirts – or more accurately t-shirt slogans, in both English and German:

Some of the results are more presentable than others; one might believe that “Budapest Bicycle Flux” was a semi-obscure math-rock band whose gig the wearer happened to catch in some college-town bar back in the day, and there are situations where one might plausibly wear a T-shirt reading “I Reject Your Reality And Replace It With Cupcakes”, which, alas, cannot be said for some of the outputs, such as “your vagina is a wonderland”, or a grid of words including “Hitlerponys”, “Mörderpenis” and/or the decidedly euphemistic-sounding “wurstvuvuzela”. … Interestingly enough, after clicking through the site for a while, a reader with a limited grasp of German may find their German comprehension improving slightly; perhaps the flood of meaningful (if nonsequiturial) sentences exercises the language pattern-matching parts of the brain in some kind of process of combinatorial fuzzing, reinforcing plausible word sequences.

Google Translate now does Esperanto

The Google Translate blog has announced that they have added Esperanto to the list of available languages:

Esperanto and Google Translate share the goal of helping people understand each other, this connection has been made even in this blog post. Therefore, we are very excited that we can now offer translation for this language as well.

The Google Translate team was actually surprised about the high quality of machine translation for Esperanto. As we know from many experiments, more training data (which in our case means more existing translations) tends to yield better translations. For Esperanto, the number of existing translations is comparatively small. German or Spanish, for example, have more than 100 times the data; other languages on which we focus our research efforts have similar amounts of data as Esperanto but don’t achieve comparable quality yet. Esperanto was constructed such that it is easy to learn for humans, and this seems to help automatic translation as well.

Google Translate: the written word

Over at Google, the New Years present for 2012 is titled Sometimes it’s just easier to write. An update to the Google Translate app for Android in which one can enter characters via the touch screen:

Our goal is to break down the language barrier, all the time, everywhere. By adding handwriting input directly into our Android app we hope to help you get translation done even more quickly and easily. Sometimes you don’t know how to say what you want translated, sometimes you can’t type it, and sometimes it’s easier just to write it. We think of handwriting on the touchscreen as another natural input…

This is still an experimental feature. It’s available in Chinese and Japanese, and you can enable it for English, French, Italian, German, and Spanish if you like. (We currently only support single-character input for Chinese and Japanese.) Just as with speech recognition and our translations themselves, our handwriting recognition happens in the cloud, allowing us to continually improve accuracy without requiring you to download new versions of the app.