Words and language – the world has changed

It’s been a while between posts, but I have recently come across a number of language related posts that I think are worth sharing.

For a while there lists of hot new words were all over the internet. That’s slowed down significantly, but I found two recently that are worth sharing. I particularly liked the list of 216 non-English words “referring to emotional states from the world’s languages that have no correlate in English”. I think what I like most about this list – hell, it’s the reason anyone finds it interesting – is because the emotional states referred to are states we can all empathize with, and because of that lack of correlate, the translations sound like poetry.

* Aware (哀れ) (Japanese): the bittersweetness of a brief, fading moment of transcendent beauty.
* Sabi (侘寂) (Japanese): aged beauty.
* Mono no aware (物の哀れ) (Japanese): pathos of understanding the transiency of the world and its beauty.

哀れ, pronounced a-wa-rey (I’m not a phonetician, sorry), is something that I feel regularly – reflections in puddles, the graceful elderly (侘寂) , my partner singing quietly while cooking. I’m also fascinated by the racial profiling I give these words – do I find that these words are interesting

* Dadirri (Australian Aboriginal): a deep, spiritual act of reflective and respectful listening.
* Koselig (Norwegian): cosy, warm, intimate, enjoyable.
* Mbuki-mvuki (Bantu): to shed clothes to dance uninhibited.
* On (恩) (Japanese): a feeling of moral indebtedness, relating to a favour or blessing given by others.
* Peiskos (Norwegian): sitting in front of a crackling fireplace enjoying the warmth.

because of what they convey or because they seem to be culturally perfect in a stereotypical understanding of their respective cultures?

* Cafune (Portuguese): tenderly running one’s fingers through a loved one’s hair.
* Desenrascanço (Portuguese): to artfully disentangle oneself from a troublesome situation.
* Estrenar (Spanish): to use or wear something for the first time.
* Fernweh (German): the ‘call of faraway places,’ homesickness for the unknown.
* Fingerspitzengefühl (German): ‘fingertip feeling,’ the ability to act with tact and sensitivity.
* Gjensynsglede (Norwegian): (noun) The joy of meeting someone you haven’t seen in a long time.
* Guān xì (關係) (Chinese): building up good social karma.
* Janteloven (Norwegian/Danish): a set of rules which discourages individualism in communities.
* Jugaad (जुगाड) (Hindi): the ability to ‘make do’ or ‘get by’.
* Kvell (Yiddish): to feel pride and joy in someone else’s accomplishment.
* Tîeow (เที่ยว) (Thai): to roam around in a carefree way.
* Ubuntu (Nguni Bantu): being kind to others on account of one’s common humanity.

The Dictionary of Fantastic Vocabulary on the other hand, is a list of completely made up words. By the look of it, the words have been created programmatically (ok, proof: look at the end of section for the letter E/e, just up from H) and meanings have been applied later. The beauty here is the recognition that the words don’t exist for a reason – very few of them are easy to say, they *look* clunky. But you could imagine over time being able to introduce some into everyday usage. The idea is greater than the execution, but I think it’s a noble failure. I really should do some analysis on the distribution across the alphabet… dammit, I just went and did it. As you can see, there are only 15 letters represented at all, and over half start with E, A or S. I presume this is a combination of prevalence in English, prevalence of prefixes starting with those letters, and the author’s internalized biases.

Total 1516 Percent of total
E 334 22.03
A 284 18.73
S 148 9.76
D 142 9.37
I 127 8.38
O 109 7.19
C 108 7.12
P 78 5.15
U 53 3.50
H 52 3.43
R 20 1.32
M 19 1.25
T 16 1.06
B 15 0.99
N 11 0.73

The final explicitly word based interest is Helen Zaltzman‘s The Allusionist podcast – not only is it a great short podcast about words, the latest is actually about dictionary.com’s word of the day – which is a mail out with 13 million subscribers. People really do love words.

The final post is an interesting linguistic article I stumbled across titled Your Ability to Can Even: A Defense of Internet Linguistics that starts with “I can’t even” and a friends recent claim “I have lost all ability to can” as a riff on the former:

Loose translation: “This link is so amazing that I have lost my ability to express my appreciation for it in fully formed sentences. All speech has been reduced to this ill-formed sentence. Thus is the depth of my excitement about this. Click on it. Click on it if you too would like to experience this level of incoherent excitement.”

While the article doesn’t address the contemporary obsession with communication via emoji (and its sometimes miscommunication, depending on OS) it does address the new field of “internet linguistics” and a new wave of conservative backlash from those that would have language stagnate. There is also some great gender analysis of the roles played in language creation:

In short, this dialect results when people who already share a language are given new tools. The result isn’t a butchering of English language but a creative experiment with it. Am I claiming that the Internet as a whole is operating on a level of postmodernism that would make Joseph Heller, Kurt Vonnegut and Thomas Pynchon seem like novices? maybe i am maybe im not u punk wut of it like who r u to tell me otherwise

Dr. Tannen does the interesting work of examining gender and tech language. In studying sample text messages, she found that women were much more likely to use enthusiasm markers like exclamation points and add emphasis via capitalization. Most linguists emphasize the lack of understanding that can take place between men and women as a result of the different value that each gender places on conveying emotions. Supposedly, women perceive men’s lack of enthusiasm markers and capitalization as coldness and men perceive women’s use of them to be unnecessary.

However, what I find most fascinating about the Internet Language is that it is making language less, not more, gendered. Men and women on the Internet use many of the same tropes, enthusiasm markers and emphasizers in order to communicate. In the world of blogging and Internet writing, women are the creators of language. It is a realm in which women are not being socialized with already existing language but are doing the work of socializing and creating a community. Women dominate every important social media platform. Women outnumber men on Facebook, Twitter and Pinterest and account for 72% of all social media users. On Tumblr, where the number of men and women is roughly equal, women dominate the conversation.

Short Cuts, the end of 2011

With the new year having started, I’ve been too busy child minding to get anything solid on the blog recently, but here are the stories that have piqued my interest:

  • The Economist has The gift of tongues, a review of Babel No More: The Search for the World’s Most Extraordinary Language Learners by Michael Erard (buy at Amazon), about hyperpolyglots. Wikipedia describes a hyperpolyglot as someone that speaks six or more languages, Erard pushes this out to eleven or more languages. Interesting points are made about the need to “prime” weaker languages in one’s repertoire before speaking in it – phase shifting has always been difficult, as translators and interpretors know well. It’s a short, interesting article, although one doesn’t get’s the feeling that the book has much more information. (edit: Unfortunately I’ve not the time or resources to read the book itself, but Michael has contacted me to say that there is much more to the book than the Economist article adresses. Of course, if I hadn’t been child minding a 30 second google search would have lead me to the book’s website, babelnomore.com, which I have just found, and where you can find out more about the book)
  • A Slashdot post called New Online Dictionaries Automate Away the Linguistic Middleman pointed me to an interesting article about potential future dictionaries. Instead of having an edited, or pruned, list of definitions, these services – two are listed, Wordnik and The Corpus of Contemporary American English – show the word you search for in a number of contexts – dictionary definition, examples on how it’s been used recently, a word’s collocates (words usually found near the original word), related words, lists the word may belong to, tweets containing the word, sounds and visuals. While I don’t think it replaces the dictionary, these services are certainly a fascinating progression from what we grew up with.
  • I can’t remember how I got there, but the 24 Ways blog has a great article Creating Custom Font Stacks with Unicode-Range. A tutorial on how to use different fonts on the same page within HTML5 – an easy and light way to change the font for numerals or caps for instance. Using the original Use the Best Available Ampersand as a case study is fantastic – a solid use case with a simpler implementation.
  • For those still hanging on to the Academy, Slashdot makes some devestating points in When Getting Rid of College Lectures Makes Sense, but let’s cut to the chase:

    Joe Redish also teaches physics, at the University of Maryland, and says, ‘With modern technology, if all there is is lectures, we don’t need faculty to do it. … Get ’em to do it once, put it on the Web, and fire the faculty.'”

    (emphasis mine). I’ve said similar before, and I believe it. Classes need to be turned around – all the lecture watching should be happening at home – class time is for discussion, problem solving and example exploration.

New issue of Translating and Interpreting

Ignacio Garcia has just sent out the latest issue of The International Journal for Translating and Interpreting Research, including articles on speech recognition in translator training (Dragsted, Mees & Hansen), translation memory-mediated environments (Mesa-Lao), legal interpreters in Ireland (Phelan) and a quantitative study on clear English for Translation (Burns and Kim). Personally, I’m most excited about Pym‘s What technology does to translating:

Abstract: The relation between technology and translating is part of the wider question of what technology does to language. It is now a key question because new translation technologies such as translation memories, data-based machine translation, and collaborative translation management systems, far from being merely added tools, are altering the very nature of the translator’s cognitive activity, social relations, and professional standing. Here we argue that technologies first affect memory capacity in such a way that the paradigmatic is imposed more frequently on the syntagmatic. It follows that the translating activity is enhanced in its generative moment, yet potentially retarded in the moment of selection, where the values of intuition and text flow become difficult to recuperate. The redeeming grace of new technologies may nevertheless lie in new modes of opening translation to the space of volunteer translation, where humanizing dialogue can enter the internal dimension of translation decisions. The regime of the paradigmatic may thus be embedded in new modes of social exchange, where translation becomes one of the five basic language skills.

My main interest, of course, is in asking those questions that Pym potentially doesn’t consider. I mean this without malice, but I do feel that the academy is carefully cotton-wooled itself from the more interesting ideas that have come from the last century. In particular, I’d be looking to ask questions like how does détournement affect his “humanizing dialogue” – for instance, in the realms of crowd-sourcedillicit subtitling.

I’ve yet to read the article, but am looking forward to it.

Short Cuts

Short cuts:

Quite a few things this week, given that my schedule has busied up, the holiday season has left us and people are back in the swing of things, it’s not that surprising I guess.

1. Slashdot is reporting that there is now a commercially available device specifically designed to help health professionals diagnose, analyse and communicate with patients between whom there is a language barrier. The Phazer is not cheap at 12-18 thousand dollars,  but has a 7″ screen that patients can point at should that be necessary and “can hold over 300 languages at any one time and, after it identifies the patient’s native tongue, gathers the necessary background information using pre-recorded videos of doctors speaking in the patient’s own language.”

2. Kaggle is a relatively new website that does very interesting competitions (for want of a better word) that relate to big data – large sets of information with potentially complex analysis required over many variables. Previously, the Netflix Prize had been a once off to improve Netflix’s (an online movie distributor) “if you liked X, you may like Y” algorithm – Kaggle have taken this idea to the next level by creating a space for organisations, businesses and researchers without big data experts.

One of their newest competitions requires “participants to develop an algorithm to identify who wrote which documents.” Which, in layman’s terms, equates to handwriting recognition:

“Writer identification is important for forensic analysis, helping experts to deliberate on the authenticity of documents. This competition aims to further the science of writer identification. It requires participants develop algorithms that can identify handwriting. This is a difficult problem because a writer never reproduces exactly the same characters.

Writer identification generally requires two steps. The first is an image-processing step, where features are extracted from the images. The second step is a classification step, where the document is assigned to the “closest” document in the dataset according to the “difference” between their features.

In this contest, a previously unpublished data has been made available, containing the writings of more than 50 writers. Participants are asked to provide a similarity score, showing how probable it is that two documents are written by the same person. For participants who are not familiar with image-processing, a set of geometrical features extracted have been provided.”

While I understand that I probably don’t have a lot of big data obsessed readers, I think Kaggle’s got a great idea. For example, if my local Metro service can’t afford to get enough analysis to get it’s damn timetabling elegant, maybe they could give the data and constraints over to Kaggle, for free or a small prize, and someone else could solve that problem for them? Or, for instance – how can we get better Machine Translations? I’ll post the results when the competition is over.

3. Quite a few podcasts have focus on language recently, and I thought I’d link through to some.

The 99% Invisible podcast which is largely about design recently focused on Esperanto and the *design* decisions behind it, which I thought was an interesting take – I’ve not thought much about how it was designed and nor have I seen much analysis in this regard. Maybe not new to some, but a great five minute intro in relation to other invented languages. Esperanto, unlike other created languages to that point, occupies a magic sloppy middle ground of specificity and arbitrary sloppiness that makes it much more natural than those that came before. There’s a lovely little anecdote about war games the US Army would play during the cold war – wanting the “other” to speak a foreign language, but not wanting to insult any racial or language group, they used Esperanto, to keep the peace.

4. Radiolab‘s episode on Words from August last year. Radiolab’s podcast follow’s a similar idea to This American Life – an hour made up of smaller vignettes. In one story we learn of a woman who learns sign language after an accident leaves her speechless, and she goes on to become an early sign-interpreter. She meets a 27 year old deaf man who doesn’t understand the concept of language – he didn’t know about sound – didn’t know he was deaf, or that deafness set him apart from the rest of humanity. She taught him words – it’s quite a beautiful story. The podcast goes on to talk about language development in children and the creation of language – going back to Shakespeare’s prolific creations that still exist today – of words and phrases that were previously unheard, then onto a neurologist that recognises her own stroke and discusses the role language plays in the brain – and what she experienced (from a researcher’s perspective) when she had her stroke. There’s a TED video about her that I urge you to watch – she’s engaging and it’s quite an unusual and interesting talk. Finally, the podcast focuses on sign language in Nicaragua in the 70s – a country that had no sign language. Children and their families and communities would create their own sign language to get by – until the dictator’s wife makes a school for the deaf – and these children create their own sign language organically. It’s development over time is particularly fascinating by virtue of the *second* generation of users – the first generation lacked words for concepts like “thinking” and “believe”, “forget” and “know” – they were so busy building the fundamentals. It’s the second generation that create this part of the language – and it leads to some very surprising results.

5. Finally, Jost Zetzsche (who’s Translator’s Toolkit was required reading until the free version became so short as to be empty) started a thread on the OmegaT list looking for a Free Software vs Commercial software stoush – and not really finding one – that lead to some interesting links: Pirate Pad for collaborative writing over the internet with each user being assigned a colour to make it easier to see who wrote what; and Gobby, which does something similar but also includes a chat function, and seems to be a locally installed software rather than an online service.



Short Cuts

Another round of recent language and translation shorts:

  • French slang word of the day: “Yaourt”:

    [‘Yaourt’ (“Yoghurt”)] is the word used to describe the practice of singing along to tracks in English, usually with an unconvincing American accent, when you have absolutely no idea of the words.

  • From the same blog there is also 20 obsolete English words that should make a comback.

    (via acb)

  • Again, from my language (obsessed, it would seem) workmate, I learn about the linguistic concept of
    False friends:

    (Frenchfaux amis) are pairs of words or phrases in two languages or dialects (or letters in two alphabets) that look or sound similar, but differ in meaning.
    Comedy sometimes includes puns on false friends, which are considered particularly amusing if one of the two words is obscene; when an obscene meaning is produced in these circumstances, it is called cacemphatonGreek for “ill-sounding”.

    (for example) “Egregious” means “outstandingly bad” in English whereas in Spanish “egregio” means “outstanding in a positive way”. The original word simply meant “outstanding from the group” (related to “gregarious”) but the meaning was narrowed down in both languages with opposite meanings.

  • I studied three years of mathematics at University and have always enjoyed and had a head for numbers. I’ve previously posted about the way that Maths, Logic, language and computing are intertwined more than most would realise – Hofstadter’s Gödel, Escher, Bach: An Eternal Golden Braid makes the interesting link to art as a further example of the interconnection (networked-ness?) of the disciplines. I ended up on the Futility Closet blog recently chasing down this fascinating, perfect magic square – and the blog turns out to be quite wonderful! There is an excellent Language section with word play, word of the day and other tricks and turns, including lots of number play. My favourite so far?

    adj. one who gives opinions and advice on topics beyond his knowledge

  • Ignacio Garcia, editor of the international journal Translation and Interpreting has just told me about a Professional development course being run by The Institute of Localisation Professionals – the information can be found here and the course content can be found here.
  • There is a word in the English language, callipygous, that means “shapely, beautiful buttocks”.

Short Cuts

Another short cuts instalment for your pleasure.

  • The economic case for open access in academic publishing is an interesting piece on exactly what it says. A little heavy on the economics, but the conclusion is interesting:

    Ultimately, I believe the academic publishing world will, and should, slowly shift toward open access, but the transition will be ugly. The issue boils down to a classic problem in economics: the tragedy of the commons. While the publishing industry and researchers continue to act in their own short-term self-interest by continuing the status quo, we are slowly heading toward an untenable situation where the people producing research papers will not be able to afford to access them.

  • When you work with nerds, you find out about the most amazing things. One of my colleagues recently made an off-hand remark about a term’s rime in relation to its syllable. Being relatively new to the atomic structures of language, I found it fascinating to know that syllables had been broken down – the onset and the rime, the rime being made up of a nucleus and optionally, a coda:

    The term rime covers the nucleus plus coda. In the one-syllable English word cat, the nucleus is a (the sound that can be shouted or sung on its own), the onset c, the coda t, and the rime at. This syllable can be abstracted as a consonant-vowel-consonant syllable, abbreviated CVC.

    I also had the pleasure of learning from Dr John McWhorter that the English languages use of ‘a’ and ‘the’ is almost unique and considered quite odd by other cultures.

  • I’ve also only recently discovered that there is a language called Lojban (“The Logical Language“) that has been created to be non-ambiguous:

    Lojban has a number of features which make it unique:

    – Lojban is designed to be used by people in communication with each other, and possibly in the future with computers.
    – Lojban is designed to be culturally neutral.
    – Lojban has an unambiguous grammar, which is based on the principles of logic.
    – Lojban has phonetic spelling, and unambiguous resolution of sounds into words.
    – Lojban is simple compared to natural languages; it is easy to learn.
    – Lojban’s 1300 root words can be easily combined to form a vocabulary of millions of words.
    – Lojban is regular; the rules of the language are without exception.
    – Lojban attempts to remove restrictions on creative and clear thought and communication.
    – Lojban has a variety of uses, ranging from the creative to the scientific, from the theoretical to the practical.

    While interesting in concept, since it’s been “built over five decades by dozens of workers and hundreds of supporters” and is yet to gain widespread in usage leads me to believe that it will remain an intellectual curiosity like Esperanto for a while to come.

  • Although it may potentially be useful in the Swiss village of Bivio – the 200 residents speak 3 languages and several dialects of each, with classes in the local school alternating languages on other days and each of the local churches sermonising in a different language.
  • The Culturally Authentic Pictorial Lexicon is looking for funding and contributors in general. They use Creative Commons images in language learning situations:

    Images do improve vocabulary acquisition and are essential for the instruction of culture. Simply put, access to good media is limited for teachers. Sure, we all have MS Clipart, but how do you explain a ticket cancelling machine in the unit on public transit with clipart? Or how do you convey what a Döner is? An image is a good place to start. Simply taking images from search engines doesn’t always work. There can be too many and finding the right image is often hard when you want to convey a specific cultural idea in a different language.

    My long term hope is that at some point there will be a truly authentic visual dictionary that is multi-lingual and authentic. Current search engines don’t have an elegant way to sift through visual content with a cultural filter. Perhaps it could be done with geo-tagging in combination with meta-data, but for now our project does it by hand. We are trudging along with our group of volunteer experts who edit the images we have in our database. It is a type of “slow media” project. As you can see in our database list, there are some starter projects just getting off the ground with less than 1000 unique entries, and we have other languages that have many more images in the database.

  • This link is more like a bookmark if anything, but I thought other’s might appreciate it. My current employer Monash University is going to be moving it’s inhouse class content delivery system from the woeful Blackboard system (woeful) now in place to Moodle in 2012. I found an interesting review of a book that looks at using Moodle to teach a second language – and from all accounts, it can be used quite successfully.
  • I have also learnt this week that the line of numbers on a book’s copyright page is called the Printers key and actually has a purpose – it’s used to indicate the print run of a book.

Short cuts

As you can see, I’m still struggling to keep up with all the news that’s flying around. I thought I’d share a collection of links that have swum before my eyes over the last fortnight that can only really be classed as entertainment.

There have also been other more serious posts:

  • Metaglossia links to a new book on translating and interpreting for Social Activism. I hadn’t heard of MetaGlossia before, and I had higher hopes for it. Unfortunately it seems to be not much more than a collection of press releases. I’m fascinated by the topic of the book, however, and know many people who potentially could learn from it – but is it worth it? As far as I can see, it’s just a collection of essays – I’d be interested to hear the thoughts of those that have read it regardless.
  • I don’t think enough people make the connections when it comes to search, MT and grammar analysis, but I do make those connections – Slashdot is reporting on some new Plagiarism in Research software. As with many Slashdot postings, it seems to be under analysed before posting – “search and destroy” is overly aggressive language regards plagiarism in my book, but I would say that, I loved that the Situationists plagiarised Isidore Ducasse.
  • TAUS has a fantastic pictorial history of Machine Translation: A Translation Automation Timeline. It’s quite nerdy, but shows quite well the growth in the industry over time – starting in 1945, it largely focuses on researchers, by 2010 most facts revolve around corporate involvement.
  • At some point I was pointed to Moses for Mere Mortals – I’ve talked about the machine translation engine Moses in a previous posts, in particular how difficult it is for the non technical to set up. This packages contains both Windows and Ubuntu packages that will make testing Moses a lot easier for the non technical. It’s also handy for getting up to speed on just what is required in the MT toolchain – given that most people are put off by the set up alone, the last few yards (using corpora for MT training for instance) are often missed or under analysed. I’ve not tried it yet myself, but you can consider this post a TODO.
  • I’ve never read her novels, but Zadie Smith writes wonderfully. I recently saw The Social Network and her analysis was interesting reading: Generation Why – it’s that kind of film. And that lead me to a fascinating article she wrote on the voice just after Obama was elected – Speaking in Tongues:

    My own childhood had been the story of this and that combined, of the synthesis of disparate things. It never occurred to me that I was leaving the London district of Willesden for Cambridge. I thought I was adding Cambridge to Willesden, this new way of talking to that old way. Adding a new kind of knowledge to a different kind I already had. And for a while, that’s how it was: at home, during the holidays, I spoke with my old voice, and in the old voice seemed to feel and speak things that I couldn’t express in college, and vice versa. I felt a sort of wonder at the flexibility of the thing. Like being alive twice.

    But flexibility is something that requires work if it is to be maintained. Recently my double voice has deserted me for a single one, reflecting the smaller world into which my work has led me. Willesden was a big, colorful, working-class sea; Cambridge was a smaller, posher pond, and almost univocal; the literary world is a puddle. This voice I picked up along the way is no longer an exotic garment I put on like a college gown whenever I choose—now it is my only voice, whether I want it or not. I regret it; I should have kept both voices alive in my mouth. They were both a part of me. But how the culture warns against it! As George Bernard Shaw delicately put it in his preface to the play Pygmalion, “many thousands of [British] men and women…have sloughed off their native dialects and acquired a new tongue.”

And I think I will leave you with that for tonight. I’ve just been conversing with a wonderful academic from Indonesia who has given me permission to re-print her essay on the mutability of Bahasa Indonesia – look forward to it.