dotSub – the last speaker of an American language

I recently got an e-newsletter from dotSub, who I’d not heard from in a while. Similar to a cross between Universal Subtitles and Vimeo, it’s another one of those fascinating attempts at making the web work.

The first video I saw in the newsletter was, funnily enough given my last post, from Al Jazeera English:

The Wichita language has dwindled so that only one fluent speaker remains, 83 year old Doris McLemore is interviewed. Then, a visit is made to a Cherokee language immersion school.

Al Jazeera launches a translation team

Emails and twitter told me this week that Al Jazeera, the Arabic language news network, announced a Universal Subtitles team today. The news network will choose videos that it thinks are interesting, relevant or timely and are asking that they get subtitled.

In particular, Arabic->English translators/subtitlers are in demand – once a video is in English, it quickly gets translated into other languages.

Sign up here!

Translation technology standards find a new home

Last month I noted that LISA, the Localization Industry Standards Association, was insolvent. As Jost notes (well, noted – it was almost a month ago now), it didn’t take long before someone stepped up to take over the role of the standards body. Initially it was TAUS (ie, the “translation buyers” industry body), quickly followed by GALA (Globalization and Localization Association, ie the “translation providers” industry body) – both organisations that represent big businesses *(TAUS does make overtures of freedom, but I’ve expressed concern about those previously). I’m going to quote Jost extensively, although edited, because I think he makes some very good points:

… the future of its most important products, the translation data exchange standards TMX, SRX, and TBX, was somewhat vague. Well, it didn’t takevery long for organizations to pop up and register as, shall we say, “interested parties” in these standards.

However, most of you will have noticed that a very important segment of our industry – the individual translator – is presently not represented by these groups or any other group that is offering to look after the standards. Is that important? I think it’s extremely important.

Here’s the crucial question: What are translation data exchange standards good for? The answer will always depend on whom you ask.

Ask the translation buyer and he will tell you that he is losing significant amounts of money each year because of gaps in interoperability between different systems. Larger LSPs would answer in a similar manner but would also stress the necessary freedom to choose the technology they already own. For the smaller language providers, that last point becomes more relevant, and for the individual translator it becomes crucial. Clearly, all of these are valid points and need support, but each group will naturally look after their own interests first and foremost.

Exchange standards will always provide some kind of exchange between technologies, but it is important for us as translators to make sure that it’s (also) happening at the level at which we need it to happen. For instance, if data is being exchanged exclusively on the side of the translation buyer or the language service provider but the exchange on our desktops is not possible or more difficult, that clearly would be against our interests. Is that going to happen? I don’t think that anyone is planning for anything like that with the existing standards or any new standards in the making, but the fact is that there are some standards that work very well for translation buyers but are certainly against our interests. In my opinion, one of those standards is DITA, an XML-based standard that provides the ability to segment the source text into small chunks that can be used in a variety of ways and allow for a great reuse of data; however, this works much to the detriment of the translator who often lacks the necessary context. I realize that we aren’t talking about DITA here, but it’s one of those prime examples where we failed to participate in the development. Another that is probably closer to most of our hearts and whereI wish there had been more translator participation in its first round of development is the termbase exchange standard TBX, a standard that I believe also has the corporate terminologist more in mind than the translator.

So what should happen? Alan Melby, I, and others have been talking about ways to harness the laudable engagement of TAUS and GALA while also representing the interests of translators. The most logical organization to represent translators on an international scale would be the aforementioned FIT, the Federation of International Translators, which has a technical committee… One possibility for all of these organizations – plus possibly some others who are stakeholders in this process – would be to form a task force under the auspices of the Translation and Interpretation Summit, an umbrella organization for many different language associations within the industry, including GALA, TAUS, and FIT.

I would be intrigued to hear what some of you have to say. Is this all not important to you? Do you have other and possibly better ideas on how to represent translators interests in this context? Are you interested in getting engaged yourself? Let me know.

I agree whole-heartedly with Jost on this – there definitely needs to be more involvement from the translators (or a translator’s union) in the development of the standards. In fact, because I’m more of a rabble rouser than Jost, I would suggest that TAUS and GALA should exclude themselves from the process completely – I don’t believe there is any room for corporate involvement in standards development, of all potential participants, they can afford to adapt.

Further, as a strong advocate of free software, I think that the leading translation FOSS developers (OmegaT devs for eg) and the project managers of the leading FOSS platform L10n projects (eg: Django, Java or Plone).

Having had that rant, I understand Jost’s position – it’s conciliatory and constructive, rather than my more tantrum-oriented everything must be free in my anarcho-world approach. But it’s only by being the ratbag that Jost comes off as a centrist, and if he hasn’t already guessed: I’ve got your back Jost. I’d be happy for you to represent me.

If you haven’t already subscribed to Jost’s fortnightly email, do it now.

Google Translate now has speech input

I run Ubuntu and the bleeding edge browser from Google, called Chromium. It’s a daily release of what is known as Chrome on other OSs. I can do this because I also have Firefox (and at least 4 others) installed for when something doesn’t work as it should.

I mention this because Google have announced that you can now use speech input with Google Translate in the Chrome browser.

If you’re translating from English, just click on the microphone on the bottom right of the input box, speak your text, and choose the language you want to translate to. In fact, you can even click on the “Listen” feature to hear the translated words spoken back to you!

I would experiment and say more, but it would seem that Chromium doesn’t have access to this function either!

OCR on the Android gets a failing grade

Google has released their Docs for the Android OS, which includes an OCR function – but it’s quality is pretty appalling. Compared with ABBYY Finereader, its output is embarrassing. I can’t say I’m that surprised. OCR is hard, and there are commercial options available at fairly reasonable prices considering the complexity of functionality. The software used by Google, Tesseract, lay dormant from 1995 to 2006. I think it can, and probably will, be improved, and Google are really the only people that could do it – they have the resources and can shoehorn it into other projects like Android or Google Docs.  There is little for the independent developer, apart from the odd grad student.

To be honest I’m quite surprised at the state of Google’s OCR – given that they are scanning a large proportion of the worlds books, you would like to think that they had already nailed this one.

DEviaNT – an innuendo identifying AI?

Ok, so it’s been a while, but in my defence, I’ve moved house and taken on full time parenting – kids are busy-making, let me tell you.

There’s been so much going on that I’m not even sure where to start…

Two AI researchers have created a system that attempts to identify innuendo called DEviaNT – short for Double Entendre via Noun Transfer. Their recent paper “That’s What She Said: Double Entendre Identification” describes an attempt at Natural Language Processing – already a hard task, made more difficult in regards to puns and humour, due to the necessity contextual hints:

These three functions were used to score sentences for noun euphemisms (ie, does a test sentence include a word likely to be used in an erotic sentence). Other elements sentences were scored on included the presence of adjectives and verbs combinations more likely to be used in erotic literature. Finally, they used some information such as the number of punctuation and non-punctuation items in sentences.

Interesting 3D language maps

Artisit R. Luke Dubois has an produced a series of maps of America, colour coded, showing distributions of words used on online dating sites called A More Perfect Union. There’s the expected analysis, as will always be the case when focusing on such a “sexy” topic

Looking at the Naughty map, DuBois said that from this image he can tell that no one in Wyoming used “naughty” in their profile, but bigger amounts of women in Colorado used “naughty. In addition, all the purple on the Nice map suggests that both men and women use “nice” in their profile.

More interestingly, and less analysed, at least by this article, is the dimensionality achieved in the maps – simultaneously showing usage and gender by state, in map form. By using gradients of red-purple-blue for gender, and bright-dark for levels of use, it’s easy to show complex information in a quick glance – light red? Used by a lot of women and few men. Dark purple? Unusued by either sex generally.

This is a particularly powerful and effective way to show such complex information.


MKB on language at the Conference on World Affairs

Maggie Koerth-Baker describes an out of hours discussion at CWA recently in regards to a panel at the conference on how the language we use often moulds our perception of reality. In her extra curricular discussions, she has a fascinating insight into how the deaf perceive and are perceived:

Over the course of the day on Monday, I spoke with several people—panelists, as well as conference volunteers and organizers—about the links between language and worldview. In one of those conversations, Emily Gunther, a conference volunteer and sign language interpreter, told me about some of the ways that Deaf culture and American Sign Language intertwine.

One of the most interesting things Gunther told me about: A lot of hearing people often describe Deaf people as “rude”. Not because of how the deaf communicate, but because of what they say.

Unless they’re born into a Deaf family, Gunther told me, most deaf people grow up being at least somewhat excluded from the spoken conversations going on around them. Someone may translate for them, but details are often left out—especially when hearing people try to be socially polite.

Think of all the times we try to describe a person without talking about a characteristic that we’re worried it might be offensive to mention. A big schnoz becomes, “You know, that guy. You’ll know him when you see him.” If your friend shows up with too much makeup on, you might say, “Wow, you’re really dressed up today.”

It’s difficult to translate that unspoken context that ASL without just saying, “That guy who has a big nose.” Or, “You’re wearing too much makeup.” Because of that—and because a lifetime of exclusion from hearing conversations has made many deaf people wary of leaving out information—it’s completely normal within Deaf culture to just say things that come off as rude to the hearing.