Only on Reddit

Aside

Only on Reddit do I discover things like this:

TIL that a popular slang in Danish for having your period is “Der er kommunister i lysthuset”, or, “There are communists in the funhouse”

When I called shenanigans – believing this to be a puerile internet prank, and maybe a little upset at the implication that communists or menstruation spoil fun times – I got confirmation from others, including more euphemisms:

I’m danish and I’ve heard it before 🙂

We also have “The red painter is visiting” and a bunch of others… The worst term for it is probably “period-sauce”…

Iceland wants to be your friend

I’ve found a lovely little country from the northern hemisphere, Iceland, and it’s very friendly. It has a number of social outlets including a tumblr which includes pictures of it hanging out with semi-famous people, and the now tumblr-obligatory f*ckyeahX (f*ckyeahIceland of course). The most fascinating for me was the link to Every Single Word in Icelandic, it’s very friendly (posts are often signed off “Your friend, Iceland”), self referential, and humorous to boot. As an Australian, I appreciate a nation that doesn’t take itself too seriously:

Sex.

This word from my people’s language makes many humans very confused.

For example, if you work in a restaurant, and one day the tele-phone rings, and a person asks you if you have “a table for sex”, you will probably not know what to say.

(Unless you live in a famous city called New York. I have heard that humans who live there always know what to say.)

If this happens to you, please do not hang up and call the pó-lice.

It is probably just one of my people, who wants to have good food in your restaurant with five of his or her friends.

Because in my people’s language sex means “six”.

Your friend,
– Iceland

Óvegur.

Humans often use the words in their languages to describe things. The word “vegur” is one of those words. (It means “road”.)

But sometimes they use words to describe things that are not really things.

“Ó-vegur” is one of those words. It means “un-road.”

Jæja.

This is a word that my people often use.

Nobody knows exactly what it means, but they think it is very useful for ending meetings.

You can just say “jæja”, and stand up.

And then, all the humans will know that their meeting is over.

(If you are very clever, and think you have a good translation, you can tell me.)

Tíst.

Until a very short time ago, only my birds and mice knew how to make tísts.

Now many of my people make them, too, on their Inter-net.

It means “tweet”.

(My people like to make old words in their language do new things like that. I will show you more words like that later.)

A nice man named Sveinbjörn was the first of my people to tíst on the Inter-net. It is not much, but you can see it here.

And if you think my people’s language has good words in it, you can follow it on the Twitter, here.

Remember people: words and meaning are important, and, in case you were wondering, I’ll finish with the word Iceland written in a number of languages:

iceland, 아이슬란드, island, 冰島, islanda, アイスランド, islande, Ισλανδία, islanti, ایسلند, islandija, ประเทศไอซ์แลนด์, izland, ÍSLAND, איסלנד, islàndia, أيسلندا, islandia, आइसलैंड, ijsland, Исланд, ysland

 

Humanising Translation Technology

Recently The Independent had an article about Google Translate which turned out to be an extract from a book by David Bellos. I certainly don’t agree with everything he says, and it is a bit waffly, but I do appreciate the direction he takes the piece.

It is not based on the intellectual presuppositions of early machine translation efforts – it isn’t an algorithm designed only to extract the meaning of an expression from its syntax and vocabulary.

In fact, at bottom, it doesn’t deal with meaning at all. Instead of taking a linguistic expression as something that requires decoding, Google Translate (GT) takes it as something that has probably been said before.

As I’ve noted before, I think this is incorrect – but would be a simple error to make for the non tech savvy or those without access to the source (as Levy did in my previous post). The reality is that search and translation are peas in a pod – both processes are looking for meaning. While I understand and accept that David is probably very close to the truth, Google themselves would be foolish if they weren’t letting their research in these two fields inform each other.

It uses vast computing power to scour the internet in the blink of an eye, looking for the expression in some text that exists alongside its paired translation.

The corpus it can scan includes all the paper put out since 1957 by the EU in two dozen languages, everything the UN and its agencies have ever done in writing in six official languages, and huge amounts of other material, from the records of international tribunals to company reports and all the articles and books in bilingual form that have been put up on the web by individuals, libraries, booksellers, authors and academic departments.

Drawing on the already established patterns of matches between these millions of paired documents, Google Translate uses statistical methods to pick out the most probable acceptable version of what’s been submitted to it.

Much of the time, it works. It’s quite stunning. And it is largely responsible for the new mood of optimism about the prospects for “fully automated high-quality machine translation”.

Google Translate could not work without a very large pre-existing corpus of translations. It is built upon the millions of hours of labour of human translators who produced the texts that GT scours.

Google’s own promotional video doesn’t dwell on this at all. At present it offers two-way translation between 58 languages, that is 3,306 separate translation services, more than have ever existed in all human history to date.

Here he makes an interesting point – and one that I’ve been pushing to surmount since I started this blog – that the Translators should be recognised for their contributions, as coders are in the FLOSS ecosystem. When I think on it further though, I wonder if it matters – does the family of the now passed translator from early last century care that Google has made all our lives better without attribution? Do the makers of the innumerable stone axe heads deserve attribution for their work in fine tuning a useful tool? Will the 23rd century users of C-3PO like robots or BabelFish care, and even if they did – would it matter to me or David?

GT is also a splendidly cheeky response to one of the great myths of modern language studies. It was claimed, and for decades it was barely disputed, that what was so special about a natural language was that its underlying structure allowed an infinite number of different sentences to be generated by a finite set of words and rules.

A few wits pointed out that this was no different from a British motor car plant, capable of producing an infinite number of vehicles each one of which had something different wrong with it – but the objection didn’t make much impact outside Oxford.

GT deals with translation on the basis not that every sentence is different, but that anything submitted to it has probably been said before. Whatever a language may be in principle, in practice it is used most commonly to say the same things over and over again. There is a good reason for that. In the great basement that is the foundation of all human activities, including language behaviour, we find not anything as abstract as “pure meaning”, but common human needs and desires.

All languages serve those same needs, and serve them equally well. If we do say the same things over and over again, it is because we encounter the same needs, feel the same fears, desires and sensations at every turn. The skills of translators and the basic design of GT are, in their different ways, parallel reflections of our common humanity.

And this is where I enjoyed this piece – apart from the always welcome English humour – the return to humanism, the bringing of all this technological talk to the poetic, the beautiful. Technology is a reflection of our humanity – as well as an amplifier of our desirers and expander of our horizons. And this is the great unspoken promise of a functional GT that is available to all for free.

OmegaT 2.5.0 released

Aside

Didier has announced the release of OmegaT 2.5.0 – get downloading!

The most important enhancement is the support of multiple translations for a given source segment. Auto-propagation still works as usual, but it is now possible to create alternatives to “default” (auto-propagated) translations. It is also possible to deactivate completely auto-propagation.

From the user interface point of view, several new panes are available in OmegaT (if needed, use Restore Main Window to make them appear). There is a Multiple Translation pane, a Notes pane, when it is possible to enter notes for each segment and a Comments pane, where non-translatable text can be extracted by the filters to give context to the translator. Currently, the PO, HTML and Java properties filters have been updated to use this feature.

Tech community anger at crowd sourced translations

Steam, the internet’s most popular game distributor, is crowd sourcing it’s game translations. This has caused anger in the tech community:

Steam/Valve has decided to build a “community effort” to get its Steam platform and game files translated by the community into 26 languages (english, czech, danish, dutch, finnish, french, german, hungarian, italian, japanese, korean, norwegian, “pirate”, polish, portugese, romanian, russian, spanish, swedish, simplified and traditional chinese, thai, brazilian, bulgarian, greek & turkish).

but here is the catch:

Translators do not get paid. They do enjoy many perks however, like access to the game text to be translated (not the game itself, god forbid they could actually test their translation within the game and not have to pay for it), and… and… that’s about it.

Update: I did some math; the test text when you sign up for Steam Translation Server is 265 words; at the current rate of 0.09 USD per word this means 23.85 USD is how much a professional translator would charge to translate that text. Now if only storefront descriptions like this are to be translated for all games (using Steam’s claim of a catalogue of over 1100 games and growing) that would mean that Steam is saving roughly 26235 USD per language (and keep in mind thats only for short storefront descriptions of games).

Now there are 26 languages on the Translation Server at present; that means roughly 26235 x 26 = 682110 USD are being saved by Steam making the “community” work for free.

To that you have to add the costs for reviewing said translations; 0.03 USD per word, so easily enough 682110 divided by 3 = 227370 USD. (that is assuming only one version of the text has to be reviewed, which is not the case)

So, Steam has just saved 909480 USD by making the “community” work for free.

I would love to hear from people that know more about translation costs in America regards the pricing that has been listed. I think the main source of anger is directly related to Steam’s large profits and that not even a free game is offered in compensation – especially when digital game distribution has a cost of almost zero – ie, it would cost Steam nothing to provide a gratuity.

There are a number of issues that spring to mind – how does one become an accredited translator into pirate for instance? This is an example of a translation effort that can almost only happen by means of crowd sourcing since the language was created on and by the internet via crowd sourcing – starting with Talk like a Pirate Day (Wikipedia entry) and then somewhat legitimised by Facebook.

Then there is the obvious problem for Steam (apart from the million dollar translation costs if done “legitimately”) of to whom to give a gratuity – would a crowd member have to submit a certain number of strings to qualify? Would it be based on votes garnered for the strings submitted, or strings accepted for the official or final translation? There’s also a time factor – games age quickly and translations take time. Crowd sourcing does a fantastic job of parallelising translation production – I would suggest that this process will be complete for Steam within the year, if not sooner – probably a saving of at least 6-12 months.

Further, without copies of the games, surely the contextual information needed to do a correct or proper translation would be missing?

Thankfully the more thoughtful crowd at Slashdot have weighed in, making the obvious point that it’s hypocritical to promote open source software (created via crowd sourcing) but denigrate translation using the same methods.

Another commentator brings subtitulos.es to attention – a crowd sourced Spanish (European, I presume) subtitle project, and a third throws in the obligatory “hovercraft full of eels” line (context).

While I understand that those translating should be afforded some recognition, given that it’s to the community’s benefit I don’t have a problem with Steam’s actions.

Translating Jokes

Translating humour is hard. There’s no other way to put it. I’m sure there are volumes of academic writing on the subject. I’m not talking about the humour that comes from (mis)translations – I’m talking about translating jokes.

It all started when a tweet alerted me to a joke going around the Chinese interwebs titled A village with only one restaurant. It took me a while to see the joke, the humour and finally the deeper revelations about safe communication between users in an aggressively censorious atmosphere – euphemism and humour become primary in the criticism of the powers that be.

Villager: Why can’t we have more than one restaurant?
Waiter: Our village is in a stage of development where more than one restaurant can lead to chaos, so we only have one restaurant.

Villager: But the food here is really not good!
Waiter: Our restaurant has only been developing for a short time. Even if the food tasted worse than this, at least it’s our own food!

Villager: But can’t it be a little cheaper?
Waiter: That would not suit the conditions of our village; the restaurant also needs to develop.

Villager: But the employees of the restaurant are all driving Mercedes Benz cars!
Waiter: To ensure fair and uncorrupt staff, you need to pay them high salaries.

Villager: But last year, you lent all the profits of the restaurant to another village.
Waiter: This is the village policy, you don’t need to worry about it.

There’s added humour in going to the original site and getting that page translated by Google translate (“Small two: one that their village is nothing good, one that other villages on what is good. Village to sell you a thief! ! !”), but again, this is merely schadenfreude at Google Translate’s expense. Of course, when you actually do want humour from GT, that’s not necessarily what you get.

About 30 seconds of searching though, and the list of links I’ve found on the front page of my Google search was amazingly informative. There’s a couple of posts about humour and the recent Arab SpringHow to translate a joke notes

Is it possible to translate a joke? Of course, but it can be difficult because jokes often depend on “inside knowledge” that has to be explained to outsiders.  As the saying goes, “if you have to explain a joke, it isn’t funny anymore.” Also, what people consider funny can vary from place to place. Consider, for example, how different American humour is from British humour, even without a language barrier to cross.

Which is then re-iterated within the Arab context in the other article, the humourously titled When Translating Jokes, Is It Important to Make the Reader Laugh?

Several of the jokes Salem and Taira used in their presentation highlighted the particular difficulties of Arabic-English translation. One of the “who’s behind Omar Suleiman” jokes, for instance, functioned entirely through a shift from fos7a to 3ameya. Certainly, English has many different registers (one could translate into Shakespearean English, into Black English, into corporate-jargon English) but none of them function quite the same way as TV-broadcasting fos7a and casual-use 3ameya.*

There were a bunch of other links that I’ve excluded due to time and quality, but the Guardian’s take on international performers at the Edinburgh Festival is a good place to finish up, noting that so much of comedy (the long play version of the joke, I guess) goes well beyond the language used to deliver it:

In Italy, says Palmieri, the culture is visual, the comedy more physical – think Roberto Benigni – and deadpan humour is known as umorismo inglese. To Palmieri, the English language is uniquely suitable for verbal humour. “It’s very idiomatic, it contains a lot of polysemantic or homophonic words, which you can play with a lot. The same things that make English difficult to learn are what make it good for comedy.”

The British comedian Stewart Lee once blamed the German reputation for humourlessness on that language’s inflexible sentence structures, which preclude the twist-in-the-tail techniques on which English-language comedy depends. Fortunately, German comic Henning Wehn has never had to translate an existing act into English – like Palmieri, he took up comedy after moving to the UK. The only difficulty he has now is with going off-script. “If I want to improvise, or go off on a tangent, I quickly come to my limit. I’ll make grammatical mistakes, or can’t think of the right words.”

But not being a native English speaker can prove an advantage.

Teeuwen says non-native speakers do comedy “the same way Sinatra sings. He’s very conscious of every word he says, and of the way he places and phrases them. He grooves, but a bit more consciously than most.”

 

*I tried different encodings for the page, but I couldn’t get other recognisable terms for fos7a or 3ameya, neither of which looks correct nor comes up as an Arabic language when I search. If someone could clarify the meanings or correct terms, I’d appreciate it.**

**This has been cleared up in the comments by gr33ndata, who is the author of the post titled “Who’s behind Omar Suleiman” linked to above.

The things that make me go whoop: Tatoeba

It’s not often you metaphorically whistle under your breath at something cool, but that’s just what I did when I learnt about Tatoeba. A site dedicated to an open, collaborative translation of sentences rather than words.

“Why sentences? You may ask. Well, because sentences are more interesting. Sentences bring context to the words.”

Even more fascinating is the contextualisation of what they are doing in comparison to what everyone else is doing – in particular the failing of platforms or software that does language pairing in that some languages get left behind, some language pairs are never consumated (Icelandic/Swahili anyone?). The openness and transparency about their motivations and methods is refreshing compared to Facebook and Google – the description of how they are overcoming the language pairing issue harks back to the mathematical principle of transitive relations (ie whenever A = B and B = C, then it’s also true that A = C) – all languages are interconnected – “awesome, right?”. Absolutely.

The video below is a great introduction, and with Universal Subtitles, it’s available in 8 languages (Italian, the ninth, is only at 27% translated). This makes perfect sense – the very idea underlying Tatoeba completes the loop that these two projects make, and proves that the old adage of parts being greater than the whole.

They provide a range of tools for the glyph based Chinese and Japanese languagesa visualisation of “what’s going on now”; and the sentences can be downloaded, although there is a warning:

The data you will find here will NOT be useful unless you are coding a language tool or doing some work on data processing.

If you want data that you can use as a humble language learner, you can check out the lists section where you can build your own lists of sentences or view others’ lists and print them.

You can make your own lists, each of which can be downloaded! This is simply amazing – my mind is boggling at the possibilites (as a computer scientist). If anything tells me that this will be a must watch platform/site, it’s that the famous meme In Soviet Russia… is already getting translated.

For machine translation freaks, Tatoeba will be producing one of the best corpora available – and it will be CC licensed. In terms of what else is available – this is very similar to the offerings of Chrome/Google+ and Facebook – but those platforms aren’t open, and aren’t giving back to the community that created those translations openly, for that community to build upon. The functions are only there to be monetised, for that community to become a commodity. Which makes Tatoeba all the more interesting:

It’s part of an ecosystem that we want to build. We want to bring language tools to the next level. We want to see innovation in the language learning landscape. And this “cannot” happen without open language resources which can not be built without a community, which can not contribute without efficient platforms. So ultimately, with Tatoeba, we are only building the foundations…to make the Web a better place for language learning.

The speed of the spoken word

An interesting piece in Time last week titled Slow Down! Why Some Languages Sound So Fast looking at a study from the journal Language:

It’s an almost universal truth that any language you don’t understand sounds like it’s being spoken at 200 m.p.h. — a storm of alien syllables almost impossible to tease apart. That, we tell ourselves, is simply because the words make no sense to us. Surely our spoken English sounds just as fast to a native speaker of Urdu. And yet it’s equally true that some languages seem to zip by faster than others. Spanish blows the doors off French; Japanese leaves German in the dust — or at least that’s how they sound.

But how could that be? The dialogue in movies translated from English to Spanish doesn’t whiz by in half the original time after all, which is what it should if the same lines were being spoken at double time. Similarly, Spanish films don’t take four hours to unspool when they’re translated into French.

Vietnamese was used as a reference language for the other seven, with its syllables (which are considered by linguists to be very information-dense) given an arbitrary value of 1.

For all of the other languages, the researchers discovered, the more data-dense the average syllable was, the fewer of those syllables had to be spoken per second — and thus the slower the speech. English, with a high information density of .91, was spoken at an average rate of 6.19 syllables per second. Mandarin, which topped the density list at .94, was the spoken slowpoke at 5.18 syllables per second. Spanish, with a low-density .63, ripped along at a syllable-per-second velocity of 7.82. The true speed demon of the group, however, was Japanese, which edged past Spanish at 7.84, thanks to its low density of .49. Despite those differences, at the end of, say, a minute of speech, all of the languages would have conveyed more or less identical amounts of information.

“A tradeoff is operating between a syllable-based average information density and the rate of transmission of syllables,” the researchers wrote. “A dense language will make use of fewer speech chunks than a sparser language for a given amount of semantic information.” In other words, your ears aren’t deceiving you: Spaniards really do sprint and Chinese really do stroll, but they will tell you the same story in the same span of time.

Obviously one of the missing pieces in this puzzle is the act of translation and interpreting – the reason that the film translated from Spanish to French doesn’t take four hours is as much a function of a language’s spoken speed as it is the effort of translating. I think that this report needs a serious review from the translation/interpreting academics.

(via The Economist)