Duolingo – translating the web

There’s only a video available at the moment, unless you want to sign up to one of their language combinations, but Duolingo‘s concept is very very interesting. Basically their set up turns translation of web segments into a language learning facility. It would seem that there are levels from beginner to expert to ease you along (Children: you should always couple reading and writing of foreign languages with spoken and aural practice!).

Interestingly there is a TED video featuring Luis von Ahn, one of the creators of CAPTCHA, about his work with Duolingo and his motivations. CAPTCHA is currently being used to help with OCR – OCR fails on old books with faded writing for instance, but CAPTCHA offers the opportunities for a distribute solution to this problem – by getting humans to decode the words. With CAPTCHAs on 350 000 sites, some as large as Twitter and FB, they are now digitising about 100 million words a day – about 2.5 million books per year – via CAPTCHA (well, reCAPTCHA to be 100% correct).

This is a fascinating project, I highly recommend the video if you are a translator and want to see a threat bigger than machine translation.

Freedom Fone documented

Last week FlossManuals teamed up with the Freedom Fone community to write some documentation (pdf). I really like this project, particularly the way it uses technology to overcome linguistic barriers:

Freedom Fone makes it easy to build interactive, two way, phone based information services using interactive audio voice menus, voice messages, SMS and polls. The DIY platform is accessible, user-friendly, low-cost, global and does not require Internet access for users and callers alike. It takes advantage of audio to address language and literacy barriers when reaching out to the millions of people living on the margins of the information society.

The book is no push over, coming in at around 160 pages including 30 pages of examples and scenarios

Freedom Fone enables you to design your own interactive menus to:

  • Share audio information with your audience; this audio information can take many forms including voice menu (press 1, press 2, etc.), educational dramas, short news items, or even a song!
  • Organise a poll to enable your audience to vote on an issue using their phone;
  • Collect SMSs from your audience – these might be updates about specific news events, alerts or similar time critical information;
  • Get your audience to leave audio messages to share their opinion on a particular topic or make reports in their own language.

Saving Garifuna

Here’s a short video called “I want to go back” – saving an endangered language about the Garifuna language. Taking popular music that already existed, infusing with Garifuna beats, and translating the lyrics, The Afri-Garifuna Jazz Ensemble are attempting to raise

…awareness of the endangered language of the Garifuna People that was proclaimed a “Masterpiece and Oral Intangible Heritage of Humanity.” Therefore, Afri-Garifuna Jazz will be another platform where the history and language of the Garifuna will be safeguarded through music.

The language is particularly interesting in that it’s made up of other languages

  • 45 % Arawak (Igñeri)
  • 25 % Carib (Kallínagu)
  • 15 % French
  • 10 % English
  • 5 % Spanish or English technical terms

But also because there is a vocabulary used by men and another used by women:

Relatively few examples of diglossia remain in common speech, where men and women use different words for the same concept, such as au ~ nugía for the pronoun “I”. Most such words are rare, and often dropped by men. For example, there are distinct Carib and Arawak words for ‘man’ and ‘women’, four words altogether, but in practice the generic term mútu is used by both men and women and for both men and women, with grammatical gender agreement on a verb, adjective, or demonstrative distinguishing whether mútu refers to a man or to a woman (mútu lé “the man”, mútu tó “the woman”).

There remains, however, a diglossic distinction in the grammatical gender of many inanimate nouns, with abstract words generally being considered grammatically feminine by men, and grammatically masculine by women. Thus the word wéyu may mean either concrete “sun” or abstract “day”; with the meaning of “day”, most men use feminine agreement, at least in conservative speech, while women use masculine agreement. The equivalent of the abstract impersonal pronoun in phrases like “it is necessary” is also masculine for women, but feminine in conservative male speech.

The part of my brain that understands the motivations and intellectual curiosity of translation and interpreting is springing all over the room right now. I probably should have done linguistics instead of mathematics at university.

On the fascism of Grammar

I don’t know who put me onto this two part essay on grammar yesterday (I feel like it was Superlinguo, but I could be wrong), but I’ve enjoyed reading/chewing on it. It starts as a piece on why grammar purism is annoying, distracting and misplaced:

When my father is interacting with people who find out he is a doctor, he often hears, “I have a medical question for you.” My sister, an accountant gets, “I have a tax question for you.” I feel particularly bad for my brother-in-law, who is both an accountant and a lawyer and who probably not only has to field general tax and legal questions but the questions of people who are in legal trouble because of their taxes. But when people find out I’m an English teacher, they often say, “I have a grammar question for you…

A big part of the problem, in my estimation is that we as a society–even the most overeducated among us–have a poor grasp of what grammar actually is and what role it plays in writing. So here it is: grammar is a set of standards that we as a linguistic group have agreed upon to help us understand one another. Those rules tend to be culturally and regionally specific and change over time. No one descended from a mountain with two stone tablets reading, “Though shalt not use a preposition at the end of a sentence.” Adhering to grammar guidelines is about making sure that you are understood. It’s also about self-presentation, but it’s not about adhering to some sort of moral code.

Grammar too often gets confused with what it is designed to produce, which is fluency. Fluency here is defined not just by your ability to speak or write in a particular language but by a certain facility with that language, the ability to make words do exactly what you want them to do, to make them sparkle and titillate and inspire, to not just say the right thing but to sound good doing it. And that may or may not include utilizing proper grammar. Often fluency means learning precisely when to follow the rules and when to break them, to tune the correctness of your usage to the expectations of your audience (idiom!). Or to use non-standard constructions for effect (Iseewhatyoudidthere). Fluency is the ability to say exactly what you mean exactly how you want, which is harder than it sounds.

I’ve written previously on language mutability in the case of Indonesian punk rock band Punkasila and why I think it’s important. In Punkasila’s case we see language and art sitting side by side – and we while we see language moving, when the art doesn’t move, it loses all power to effect change. This piece attributed to Mark Twain, and Valerie Yule’s long career as an educator have been my two go to references, this will be my third.

As I write this, the music of artist Dual Core has come on and realise that hip hop threw grammar out the window over twenty years ago and hasn’t seen a reduction in popularity as a result. Criticisms of the genre have never been “that was poorly articulated”, quite the opposite in fact – when an MC can “make the words flow”, or express meaning in a clever and unique way, they are lauded.

While the headline I’ve chosen is overblown, my essential concern is one of conservative thought versus progressive thought. If we don’t sculpt our language in such a way that we can express new ideas, or old ideas and beauty in new ways, we run the risk of stagnation. A rusting on of ideas, an increasing boredom with beauty and difference. And that’s not the world I want to live in.

Part two of this essay is less rant, more literature – but has it’s own beauty. In particular, it address the idea of language formation moving between languages, in relation to Rushdie’s The Satanic Versus, and the richness that it provides

However, you also have to account for the fact that Rushdie often uses the speech patterns of Central Asian English speakers in his prose, and that is part of what de-familiarizes it, though in an intriguing way, I think. There is an aural quality to his writing that makes for great out-loud reading. As an Indian man who grew up in the wake of the British Raj and inhabits a globalizing society, he is interested in how linguistic groups from the former colonies have adapted the language of their colonizers. But he isn’t exactly doing dialect, which has historically been used as a kind of literary black-face. He isn’t trying to convey a character’s accent through non-standard spelling. Instead, he reproduces the idiom and cadence of those speech patterns, which is really effing cool.

It is for this reason that I don’t believe that translators and interpretors need worry about their working futures – computing has a long way to go before it can weave this magic.

PunyCode – internationalising the web

I signed up to the JavaScript Weekly email blast a few weeks ago, and have enjoyed flicking through it’s range of stories. I was interested this week to see punycode.js was released – a small JavaScripit library that does PunyCode conversions. What is PunyCode?

When the Internet, and computers for that matter, were first being developed, no one thought so far ahead as to add language neutrality – everything was in English. That’s not entirely true – it was all in ASCII, but is a story for another day. I presume the thinking either just wasn’t there, or it was considered a problem to be addressed when it arose. I find it hard to be overly critical of this approach – while it is culturally insensitive and US centric, no one had any idea what massive changes were about to be unleashed on to the whole planet as a result of their research.

The situation has come a long way since then – from a thriving Internationalisation and Localisation industry to the subtitling underground, the world is now online.

The one thing missing of course, were foreign (ie, non English alphabet) characters in URLs or domain names, but

(i)n late 2009, the Internet Corporation for Assigned Names and Numbers (ICANN) approved the creation of internationalized country code top-level domains (IDN ccTLDs) in the Internet that use the IDNA standard for native language scripts.

Essentially, non ASCII scripts like Chinese characters and Arabic scripts, we approved in URLs. A great day for an international internet – with Egypt, the Russian Federation, Saudi Arabia, and the United Arab Emirates being the first countries to have the opportunity. Of course, there was one remaining problem – that the whole software stack between bare metal and the browser was written expecting ASCII characters, to re-write it all would take an unacceptable amount of time and the perceived increase in complexity would potentially make some quite svelte software overbloated. Computer Scientists ended up doing what they do best – route around the problem.

This is where PunyCode comes in – it’s designed to map the Unicode character set into ASCII for lower level software to understand. It uses complex mathematics to do this that you are welcome to attempt to understand in the three main RFCs that address this issue: 3492: Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications, 5891: Internationalized Domain Names in Applications (IDNA): Protocol, and the less useful to non-nerds than it sounds 5894: Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale.

Luckily, that’s what I’m here for – although it should be note that RFCs adhere to templating standards that put almost every other academic journal on the planet to shame – computer scientists and engineers have this thing about exactness I guess.

I’ll start with the a little of the backgrounding, with examples following.

Users have expectations about character matching or equivalence that are based on their own languages and the orthography of those languages. These expectations may not always be met in a global system, especially if multiple languages are written using the same script but using different conventions. Some examples:
o A Norwegian user might expect a label with the ae-ligature to be treated as the same label as one using the Swedish spelling with a-diaeresis even though applying that mapping to English would be astonishing to users.

o A German user might expect a label with an o-umlaut and a label that had “oe” substituted, but was otherwise the same, to be treated as equivalent even though that substitution would be a clear error in Swedish.

o A Chinese user might expect automatic matching of Simplified and Traditional Chinese characters, but applying that matching for Korean or Japanese text would create considerable confusion.

o An English user might expect “theater” and “theatre” to match.

Some examples pulled from RFC3492:

7.1 Sample strings

In the Punycode encodings below, the ACE prefix is not shown. Backslashes show where line breaks have been inserted in strings too long for one line.

The first several examples are all translations of the sentence “Why can’t they just speak in <language>?” (courtesy of Michael Kaplan’s “provincial” page [PROVINCIAL]). Word breaks and punctuation have been removed, as is often done in domain names.
(A) Arabic (Egyptian):
u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644 u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
Punycode: egbpdaj6bu4bxfgehfvwxn

(B) Chinese (simplified):
u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
Punycode: ihqwcrb4cv8a8dqg056pqjye

(C) Chinese (traditional):
u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
Punycode: ihqwctvzc91f659drss3x8bo0yb

(D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute <ccaron>esky
U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D u+0065 u+0073 u+006B u+0079
Punycode: Proprostnemluvesky-uyb24dma41a

(E) Hebrew:
u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8 u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2 u+05D1 u+05E8 u+05D9 u+05EA
Punycode: 4dbcagdahymbxekheh6e0a7fei0b

(F) Hindi (Devanagari):
u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939 u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947 u+0939 u+0948 u+0902
Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd

(G) Japanese (kanji and hiragana):
u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092 u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa

(H) Korean (Hangul syllables):
u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\

(I) Russian (Cyrillic):
U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A u+0438
Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l

(J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 u+0061 u+00F1 u+006F u+006C
Punycode: PorqunopuedensimplementehablarenEspaolfmd56a

(K) Vietnamese:
U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068 u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067 U+0056 u+0069 u+1EC7 u+0074
Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g

The next several examples are all names of Japanese music artists, song titles, and TV programs, just because the author happens to have them handy (but Japanese is useful for providing examples of single-row text, two-row text, ideographic text, and various mixtures thereof).

(L) 3<nen>B<gumi><kinpachi><sensei>
u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
Punycode: 3B-ww4c5e180e575a65lsy2b

(M) <amuro><namie>-with-SUPER-MONKEYS
u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D U+004F U+004E U+004B U+0045 U+0059 U+0053
Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n

(N) Hello-Another-Way-<sorezore><no><basho>
U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
Punycode: Hello-Another-Way–fc4qua05auwb3674vfr0b

(O) <hitotsu><yane><no><shita>2
u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
Punycode: 2-u9tlzr9756bt3uc0v

(P) Maji<de>Koi<suru>5<byou><mae>
U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059 u+308B u+0035 u+79D2 u+524D
Punycode: MajiKoi5-783gue6qz075azm5e

(Q) <pafii>de<runba>
u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
Punycode: de-jg4avhby1noc0d

(R) <sono><supiido><de>
u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
Punycode: d9juau41awczczp

The last example is an ASCII string that breaks the existing rules for host name labels. (It is not a realistic example for IDNA, because IDNA never encodes pure ASCII labels.)

(S) -> $1.00 <-
u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 u+003C u+002D
Punycode: -> $1.00 <–

I hope this has been as enlightening for you as it has for me – I was unaware of PunyCode before today as well.

Anaphraseus updated

Anaphraseus, a Wordfast (Classic) equivalent for LibreOffice or OpenOffice, has had a big update just recently, mostly to deal with the deprecation of the Google Translate API, but with some other goodies as well:

What’s new:

* Fixed hanging in table during cleanup/restore
* Placeables (only numbers up to now)
* Recognized soft hypen in terms
* Bullets/numbering fix
* Mark terms cursor fix
* Fixed quotes for Google Translate
* Fixed bug in concordance search for big TMs (issue #2101540)
* Google Translate module rewritten to meet shutdown of API v1

It can be downloaded here.

For those that are still using Open Office, I highly recommend the move to Libre Office. The company that owned the copyright to OOo was sold and the codebase was forked when the new owner didn’t move fast enough to address some long outstanding issues. The Libre Office code base now has a large number of fixes that previously had been waiting a long time to be fixed in OO, development is significantly faster, and most Linux distros are moving to LO as the default over OO.