From it’s text based beginnings as Bulletin Board Systems/Services (BBSs) and USENET the internet has been used as a place to distribute the weird and wonderful.

Before Digg and Reddit existed, similar offerings were available from MetaFilter (MeFi) and SomethingAwful. I long ago signed up for Digg and Reddit, but for some reason I never really got the hang of MeFi – until recently.

I joined a week or so ago, and I’m pretty impressed so far. Here are a few example of stuff that I’ve found just yesterday:

MeFi’s Learn Korean Easy (Oh, the grammar!) reposts artist/adventurer Ryan Estrada’s great comic called Learn to read Korean in 15 minutes which is fascinating. An internet holes opens up as I go searching for more information on Hangul, the origin of Hangul and it’s promulgator Sejong the Great. I know what I’ll be doing on my next interminable wait at an airport, which from the comments seems to be the place most people like to learn the phonetic alphabet system.

The other post of interest, lighter of my lifeboat, firearm of my loincloths, explains a neat artistic morph of text called the N+7 procedure, developed by French poet Jean Lescure. The rules are simple – change every noun to the seventh noun after it in a dictionary.

The N + 7 Machine is a page that implements the procedure for N <= 15 on text that you enter or paste in.

Go forth and ART!




Mostly about language

I don’t like blogging like this, but it’s hard to find the time with an intermittent Internet. I find titbits, but I rarely follow links – I’ve not watched an online video in almost a year and my inbox has an email thread containing 276 emails with over 400 links to “revisit” once I return to the land of faster bandwidth. As though anyone on the Internet has time for 400+ old links.

However as someone that is interested in language, it behooves me to relay this content that I’ve found.

I don’t know why I have a low opinion of Will Self, but I do. As a self important anarchist I think that I rub up against other self important *ists. Despite this I found his latest piece for the BBC, In defence of obscure words, a rollicking good skewering of the stupid, the vapid, the empty. Be it expressing a love of words and language and using them:

I’d point out that my texts were as full of resolutely Anglo-Saxon slang as they were the flowery and the Latinate. I’d observe that English, being a mishmash of several different languages, had a large and exciting vocabulary, and that it seemed a shame not to use it – especially given that it went on growing all the time, spawning argot and specialist terminology as freely as an oyster does its milt.

or the end result of a culture built by the risk adverse:

But now that all formerly difficult subject matter is, if not exactly permitted, readily accessible, cultural artificers have no need to aim high. The displacement of aesthetically and intellectually difficult art as the zenith has resulted in all sorts of sad and interrelated phenomena.

In the literary world, books intended for child readers are repackaged and sold to kidult ones, while even notionally highbrow arbiters – such as Booker judges – are obsessed by that nauseous confection “a jolly good read”. That Shakespeare remains our national writer is, frankly, bizarre, given that with his recondite vocabulary, myriad historical references, and convoluted metaphorical language, were he to be seeking publication in the current milieu, his sonnets and plays would undoubtedly also be branded as ‘too difficult’.

As for visual arts, the current Damien Hirst retrospective at Tate Modern is a perfect opportunity to see what becomes of an artificer whose impulse towards difficult subject matter was unsupported by any capacity for hard cogitation or challenging artistry. The early works – the stuffed animals and fly-bedizened carcasses – retain a certain – albeit recherché – shock value, while the subsequent ones degenerate steadily to the condition of knocked-off merchandise, making the barrier between the gift shop and the exhibition space evaporate in a puff of consumerism.

But the most disturbing result of this retreat from the difficult is to be found in arts and humanities education, where the traditional set texts are now chopped up into boneless nuggets of McKnowledge, and students are encouraged to do their research – such as it is – on the web.

I quite enjoyed the brief moment of intellectual challenge that he poses.

Which is why I now turn to more a phenomena that really only exists because of the Internet but grew from the old style newsprint tropes “Word of the day”, maybe combined with “What in the world” – the longer form list of obscure, obtuse, unused, hard to translate or extinct words. Usually in groups of five, eight or ten. I’m not immune to posting links these lists here on Pineapple Donut, but it’s not often that it’s done anew – as an infographic and without the pronunciation of the words. And to stick it up to Mr Self, I found it though the most internet of ways – in RSS from a tumblr called this isn’t happiness, via mentalfloss, and then PopSci, to the original artist’s site, 21 Emotions with No English Word Equivalents.

At first I was put off by the filter of emotive words, but I came around as I thought about it – not only was Pei-Ying’s choice considered in that it provided a focus that’s easy to explain, empathise with and understand, but it gave her the opportunity to explore feelings that don’t have words in English, or any other language presumably, but are unique and identifiable to the (ahem, current) internet age. Unfortunately the artist’s site was so popular after the various postings that their broadband limit has been blown, or 509’d in tech speak.

I didn’t know that the Talkly awards even existed, but the Crikey language blog, FullySic, noted that last year it given to Ingrid Piller. Awarded for an individual who has done the most to increase public knowledge about language, she sounds like the person we would most like to be sitting next to on the 6am flight from Nadi to Tarawa.

Cory Doctorow fires up more passion in people than I’d expect – I find him interesting, intelligent and sometimes even enthralling, but the argy bargy that follows him is hard for me to comprehend. He writes for the Guardian on the difference between value and price in the internet era, largely focusing on positive externalities and their exploitation. Most interesting to me is his use of Google and it’s approach to translating.

A positive externality arises when you do something you want to do that also makes life better for someone else. For example, if you drive your car slowly and carefully to avoid a wreck, a positive externality is that other users of the road have a safer time of it, too. If you keep up your front garden because it pleases you, your neighbours get the positive externality of slightly buoyed-up property values from living on a nicely kept street.

Positive externalities — virtuous cycles — are all around us. Your kid learns to speak because of all the people around her who carry on conversations and because of the TV shows and radio programmes where speaking occurs (as do immigrants like my grandmother, whose English fluency owes much to daytime TV after she came to Canada from Russia).

Google is a case-study in harvesting positive externalities. It offered a free, voice-based directory assistance number, and used the interactions users had with its software to build a corpus of common phrases, expressed in multiple accents and under a wide range of field conditions. Then it used this to train the voice-recognition software that powers its Android-based phone-search. Likewise, it mined all the publicly available translations on the web – EU documents that appeared in multiple languages, fan-based translations for subtitles on cult cartoons, and everything else it could find – and used this to train its automated translation engine, providing it with the context that it needed to figure out the nuance and sense of ambiguous phrases.

He contends that the defining mania of the internet era is

resentment over positive externalities. Many people and companies have concluded that if someone, somewhere, is getting value from their labour, that they should get a cut of that value… Many people have accused Google of “ripping off” the public by indexing content, or analysing it, or both. Jaron Lanier recently accused Google of misappropriating translators’ labour by using online translated documents as a training set for its machine-translation engine – an extreme version of many labour-oriented critiques of online business.

leading to

the infectious idea of internalising externalities turns its victims into grasping, would-be rentiers. You translate a document because you need it in two languages. I come along and use those translations to teach a computer something about context. You tell me I owe you a slice of all the revenue my software generates. That’s just crazy. It’s like saying that someone who figures out how to recycle the rubbish you set out at the kerb should give you a piece of their earnings. Harvesting positive externalities involves collecting billions of minute shreds of residual value – snippets of discarded string –and balling them up into something big and useful.

While I enjoy his take, either he or Lanier has missed the mark. If Lanier’s critique was purely about the Google Translation Toolkit it would be understandable, but as is pointed out in the comments – the EU have made the translations available for exactly that purpose. Similarly, all the Free and Open Source software translation files have been there in the public domain waiting to be harvested since the movement started in the early 1990s – it was just a matter of someone thinking to harvest the files, and having the hardware and technical expertise to do so. And indeed, those files remain open source – someone else is welcome to harvest the same files. Google hasn’t locked them up. The Translation service on the other hand, asking for Translator’s Translation Memories and storing them – that is taking other people’s work. I guess the question then becomes can Google guarantee that they haven’t used those TMs in their translation service.

Finally, for the real language nerds, Matt Might’s The language of languages is a healthy, if slight, refresher on context free grammars:

Languages form the terrain of computing.

Programming languages, protocol specifications, query languages, file formats, pattern languages, memory layouts, formal languages, config files, mark-up languages, formatting languages and meta-languages shape the way we compute.

So, what shapes languages?

Grammars do.

Grammars are the language of languages.

Behind every language, there is a grammar that determines its structure.

This article explains grammars and common notations for grammars, such as Backus-Naur Form (BNF), Extended Backus-Naur Form (EBNF) and regular extensions to BNF.

The discussion on context sensitive grammars and parsing is poorly explained to my mind, in need of more explanation  and the article in general could be more interesting to the non computer scientist with a little more work. A primer only really.

One step at a time: Google plays the longest game around

Slashdot has informed me of a HuffPo piece by Found in Translation co-author Nataly Kelly about Google hiring Ray Kurzweil, potentially the world’s most eccentric dork.

The beauty of the web shines through when a commentator can sum it up and extrapolate better than the original post:

You need to investigate the entire initiative Google is spearheading around its acquisition of Metaweb. They are building an ontology for human knowledge, and are ultimately building the semantic networks necessary for creating an inference system capable of human level contextual communication. The old story about the sad state of computers’ contextual capacity, recounts the story of the computer that translates the phrase “The spirit is willing, but the flesh is weak.” from English to Russian and back and what they got was “The wine is good but the meat is rotten.”

The new system won’t have this problem. Because it will instantly know about the reference coming from the Bible. I will also know all the literary links to the phrase, the importance of its use in critical historical conversations, The work of the Saints, the despair of martyrs, in short an entire universe of context will spill out about the phrase and as it takes the conversational lead provided by the enquirer it will dance to deliver the most concise and cogent responses possible. In the same way, It will be able to apprehend the relationship between a core communication given in context ‘A’ and translate that conversation to context ‘B’ in a meaningful way.

Ray is a genius for boiling complex problems down into tractable solution sets. Combine Ray’s genius with the semantic toy shop that Google has assembled, and the informational framework for an autonomous intellect will become. The real question is how you make something like that self aware. There’s a another famous story about Helen Keller, before she had language. symbolic reference, she lived like an animal. Literally a bundle of emotions and instincts. One moment, one utterly earth shattering moment there was nothing, then Annie Sullivan her teacher placed her hand in a stream of cold water and signed water in her palm. Ellen understood… water. In the next moment Ellen was born as a distinct and conscious being, she learned that she had a name, that she was. I don’t know what that moment will look like for machines, I just know its coming sooner than we think. I also can’t be certain whether it will be humanities greatest achievement or our worst mistake. That awaits seeing.

The Endangered Language Project

I’m sure I’ve posted about this mob before, or a very similar project – unfortunately I’m not in a position to search through my posts to find where I might have, so in the meantime: The Endangered Language Project.

Google has had a role in developing this project and has a press release up now:

The Endangered Languages Project, backed by a new coalition, the Alliance for Linguistic Diversity, gives those interested in preserving languages a place to store and access research, share advice and build collaborations. People can share their knowledge and research directly through the site and help keep the content up-to-date. A diverse group of collaborators have already begun to contribute content ranging from 18th-century manuscripts to modern teaching tools like video and audio language samples and knowledge-sharing articles. Members of the Advisory Committee have also provided guidance, helping shape the site and ensure that it addresses the interests and needs of language communities.

Google has played a role in the development and launch of this project, but the long-term goal is for true experts in the field of language preservation to take the lead. As such, in a few months we’ll officially be handing over the reins to the First Peoples’ Cultural Council (FPCC) and The Institute for Language Information and Technology(The LINGUIST List) at Eastern Michigan University. FPCC will take on the role of Advisory Committee Chair, leading outreach and strategy for the project. The LINGUIST List will become the Technical Lead. Both organizations will work in coordination with the Advisory Committee.

As part of this project, research about the world’s most threatened languages is being shared by the Catalogue of Endangered Languages (ELCat), led by teams at the University of Hawai’i at Manoa and Eastern Michigan University, with funding provided by the National Science Foundation. Work on ELCat has only just begun, and we’re sharing it through our site so that feedback from language communities and scholars can be incorporated to update our knowledge about the world’s most at-risk languages.

Building upon other efforts to preserve and promote culture online, has seeded this project’s development. We invite interested organizations to join the effort. By bridging independent efforts from around the world we hope to make an important advancement in confronting language endangerment. This project’s future will be decided by those inspired to join this collaborative effort for language preservation. We hope you’ll join us.

Uncanny Valley

The Uncanny Valley refers to a theory that at some point in the development of robots and CGI, just when they reach “almost but not quite” human replication, humans will react with revulsion rather than recognition. It’s a term that has been in greater focus over the last half decade as more and more robots are being developed and CGI advancements have been improving..

Well, now the original essay, written by Japanese roboticist Masahiro Mori, has been officially translated.


Famous ‘Uncanny Valley’ Essay Translated, Published In Full

Updated Libre Office

I discovered that the premier free office software, Libre Office, was updated to version 3.5 recently. For those working with language, amongst the new features and fixes are a some Localisation improvements that justify an upgrade. If you are paying for a competitive office suite, I recommend you try Libre Office before spending the money at you next upgrade opportunity.


  • Added Arabic, Aragonese, Belarusian, Bengali, Breton, Bulgarian, Scottish Gaelic, Greek, Gujarati, Hindi, Latvian, Brazilian Portuguese, European Portuguese, Sinhala, and Telugu spelling dictionaries. (Andras Timar)
  • Use of possessive genitive case and/or partitive month names if provided by a locale’s locale data (e.g., Russian, Polish, Finnish, Lithuanian, and others).
    If a day of month (D or DD) is present in a number formatter’s date format code, the month name for MMM or MMMM is displayed in possessive genitive case or partitive case.
    Else if no day of month is present, the month name is displayed as noun / nominative case.
    See blog for more details. (Eike Rathke)
  • Corrections to Polish [pl-PL], Portuguese [pt-PT and pt-BR], Slovenian [sl-SI], and Latin [la-VA] locale data, esp. date formats. (Eike Rathke, Martin Srebotnjak, Mateusz Zasuwik, Olivier Hallot, Roman Eisele, Sérgio Marques)
  • Initial support for two new UI languages, Luxembourgish (lb) and Tatar (tt) 
    LibreOffice 3.5 supports 107 UI languages.

Useful English words

Reddit, always a source of entertaining group intelligence, has been asked ESL redditors, what’s a really useful English word that you don’t have in your native language? And they have responded with the usual gusto, listing many, and being corrected when wrong:

Of course, then there are the more, shall we say, informative answers. I’ll put a strong language themes warning here, but there’s a lot to be learnt:

The Stupid and/or MythicalRonald Reagan laughed at the Russians because their language didn’t even have a word for ‘detente’.

The Profane then educative: Fuck. Which leads to the claim and then counter claim about being the most useful word. The counter claim is fascinating, introducing me to the previously unknown Chinese poet Yuen Ren Chao and his amazing poem Lion-Eating Poet in the Stone Den, which I will present here in it’s entirety:

Shī Shì shí shī shǐ

Shíshì shīshì Shī Shì, shì shī, shì shí shí shī. Shì shíshí shì shì shì shī. Shí shí, shì shí shī shì shì. Shì shí, shì Shī Shì shì shì. Shì shì shì shí shī, shì shǐ shì, shǐ shì shí shī shìshì. Shì shí shì shí shī shī, shì shíshì. Shíshì shī, Shì shǐ shì shì shíshì. Shíshì shì, Shì shǐ shì shí shì shí shī. Shí shí, shǐ shí shì shí shī, shí shí shí shī shī. Shì shì shì shì.

In English:

Lion-Eating Poet in the Stone Den
In a stone den was a poet called Shi, who was a lion addict, and had resolved to eat ten lions.
He often went to the market to look for lions.
At ten o’clock, ten lions had just arrived at the market.
At that time, Shi had just arrived at the market.
He saw those ten lions, and using his trusty arrows, caused the ten lions to die.
He brought the corpses of the ten lions to the stone den.
The stone den was damp. He asked his servants to wipe it.
After the stone den was wiped, he tried to eat those ten lions.
When he ate, he realized that these ten lions were in fact ten stone lion corpses.
Try to explain this matter.

You can hear Google pronounce it here

The French double entendres: Not having a word for sibling (interesting) then leads to the claim that they have mutant words for the number’s 70-99 and the obvious “too distracted after 69” joke.

The where-else-but-redditApparently, the english ability to verbify other words eg, “scienced” or “googled”, is somewhat unusual, and missed, which leads to some interesting mutations and silliness.

And finally The weirdest of afflictions: the Spanish have no word for moist, which others (and I’ve met one once) consider to be the nastiest word in English, and is apparently one of the main triggers to those with the affliction known as Word Aversion:

But there are a few words that, very often, make me sick to my stomach, and, it turns out, I’m not the only one. This is, I’ve learned, just part of language and is known as “word aversion.” It’s not like word rage, which occurs when you hate a word or phrase because of its associations with a particular group of people or trend, (“bromance,” “Twi-hard”), because people often use it incorrectly, (“your/you’re”) or because you think it’s pretentious, (“nomenclature,” “obtuse,” “pretentious”). Word aversion has nothing to do with meaning and is all about the actual word. Word aversion is, according to Language Log, …bred of the mysterious relationships between language, emotion, memory, sound and mouthfeel.” (Sidebar: “Mouthfeel” is just an awful, awful word. Why would anyone include “mouthfeel” in an essay about word aversion?)