Short Cuts

Short cuts:

Quite a few things this week, given that my schedule has busied up, the holiday season has left us and people are back in the swing of things, it’s not that surprising I guess.

1. Slashdot is reporting that there is now a commercially available device specifically designed to help health professionals diagnose, analyse and communicate with patients between whom there is a language barrier. The Phazer is not cheap at 12-18 thousand dollars,  but has a 7″ screen that patients can point at should that be necessary and “can hold over 300 languages at any one time and, after it identifies the patient’s native tongue, gathers the necessary background information using pre-recorded videos of doctors speaking in the patient’s own language.”

2. Kaggle is a relatively new website that does very interesting competitions (for want of a better word) that relate to big data – large sets of information with potentially complex analysis required over many variables. Previously, the Netflix Prize had been a once off to improve Netflix’s (an online movie distributor) “if you liked X, you may like Y” algorithm – Kaggle have taken this idea to the next level by creating a space for organisations, businesses and researchers without big data experts.

One of their newest competitions requires “participants to develop an algorithm to identify who wrote which documents.” Which, in layman’s terms, equates to handwriting recognition:

“Writer identification is important for forensic analysis, helping experts to deliberate on the authenticity of documents. This competition aims to further the science of writer identification. It requires participants develop algorithms that can identify handwriting. This is a difficult problem because a writer never reproduces exactly the same characters.

Writer identification generally requires two steps. The first is an image-processing step, where features are extracted from the images. The second step is a classification step, where the document is assigned to the “closest” document in the dataset according to the “difference” between their features.

In this contest, a previously unpublished data has been made available, containing the writings of more than 50 writers. Participants are asked to provide a similarity score, showing how probable it is that two documents are written by the same person. For participants who are not familiar with image-processing, a set of geometrical features extracted have been provided.”

While I understand that I probably don’t have a lot of big data obsessed readers, I think Kaggle’s got a great idea. For example, if my local Metro service can’t afford to get enough analysis to get it’s damn timetabling elegant, maybe they could give the data and constraints over to Kaggle, for free or a small prize, and someone else could solve that problem for them? Or, for instance – how can we get better Machine Translations? I’ll post the results when the competition is over.

3. Quite a few podcasts have focus on language recently, and I thought I’d link through to some.

The 99% Invisible podcast which is largely about design recently focused on Esperanto and the *design* decisions behind it, which I thought was an interesting take – I’ve not thought much about how it was designed and nor have I seen much analysis in this regard. Maybe not new to some, but a great five minute intro in relation to other invented languages. Esperanto, unlike other created languages to that point, occupies a magic sloppy middle ground of specificity and arbitrary sloppiness that makes it much more natural than those that came before. There’s a lovely little anecdote about war games the US Army would play during the cold war – wanting the “other” to speak a foreign language, but not wanting to insult any racial or language group, they used Esperanto, to keep the peace.

4. Radiolab‘s episode on Words from August last year. Radiolab’s podcast follow’s a similar idea to This American Life – an hour made up of smaller vignettes. In one story we learn of a woman who learns sign language after an accident leaves her speechless, and she goes on to become an early sign-interpreter. She meets a 27 year old deaf man who doesn’t understand the concept of language – he didn’t know about sound – didn’t know he was deaf, or that deafness set him apart from the rest of humanity. She taught him words – it’s quite a beautiful story. The podcast goes on to talk about language development in children and the creation of language – going back to Shakespeare’s prolific creations that still exist today – of words and phrases that were previously unheard, then onto a neurologist that recognises her own stroke and discusses the role language plays in the brain – and what she experienced (from a researcher’s perspective) when she had her stroke. There’s a TED video about her that I urge you to watch – she’s engaging and it’s quite an unusual and interesting talk. Finally, the podcast focuses on sign language in Nicaragua in the 70s – a country that had no sign language. Children and their families and communities would create their own sign language to get by – until the dictator’s wife makes a school for the deaf – and these children create their own sign language organically. It’s development over time is particularly fascinating by virtue of the *second* generation of users – the first generation lacked words for concepts like “thinking” and “believe”, “forget” and “know” – they were so busy building the fundamentals. It’s the second generation that create this part of the language – and it leads to some very surprising results.

5. Finally, Jost Zetzsche (who’s Translator’s Toolkit was required reading until the free version became so short as to be empty) started a thread on the OmegaT list looking for a Free Software vs Commercial software stoush – and not really finding one – that lead to some interesting links: Pirate Pad for collaborative writing over the internet with each user being assigned a colour to make it easier to see who wrote what; and Gobby, which does something similar but also includes a chat function, and seems to be a locally installed software rather than an online service.