The internet kids are all right

I’m an outsider in the academy. There, I said it publicly. I don’t think that the institution I work for, or the milieux I’ve landed in, really understands the situation that faces us. Admittedly my government (and many others, I assume) demands publishing as a measurement of success, but this industry doesn’t have time for quarterly published dead-wood journals. The Internet is a here and now medium and as translators and interpreters use more and more technology and online communication, the T&I field needs to keep up, regardless of what’s acceptable in other academic fields. Recently there have been reports about how hard it is for the young to learn “like we used to when I was a boy” due to the immediacy of the Internet.

Don Tapscott has written an enlightened and justifying article in the huffpo that assures me that even if my colleagues don’t quite understand what’s going on, the students I’m teaching do.

Publishing old-stylee be damned. One of the first things that I realised when I was researching my then new post at Monash University was that the “literature” was woefully out of date. Even the most competent and knowledgeable of academics in the translation field (respect: Garcia, Pim) were/are still publishing in the old media. It just doesn’t cut it. The dialogue needs to move to the places where it will be read, where it will be found.

Four months ago I was invited to present at a conference on Technology and Translation, and I was proud as punch. I had a great time and met some amazing people – that’s what conferences are about, right? But when it came to writing some thing up, becoming a “published” academic, and while it would have helped my career as far as the faculty were concerned, it just didn’t sit right. It didn’t fit into my schedule, and it didn’t gel with what I was teaching or what I understood of the way students are learning. And it’s one of the the reasons I started this blog – here is where I can draw links, I can make notes, I can highlight, what is happening in Translation and Technology. The subject is moving too fast and furiously to wait for a journal to accept a paper and print it six months later – by then it’s old news.

I accept that my writing isn’t as amazing, thought provoking or have the intellectual rigour of a published article in the traditional sense, but by writing here I get better at writing itself, and in the meantime, I stay contemporary – something that my more eloquent peers will miss out on if they are not careful.

The article I link to above is instructive – like Vishad, I don’t need the academy. I’m a consultant and sysadmin that can make a living doing those things, I’ve no need for a PhD, the histrionics surrounding one, or a career in the academy.

If you are looking for somewhere easy to start on this topic, I recommend Steve Johnson’s Everything Bad is Good for You. I read his Emergence: The Connected Lives of Ants, Brains, Cities, and Software way back in the early noughties, which along with Bey‘s Temporary Autonomous Zones, greatly influenced how I think today.

Exploring new publishing formats…

I’ve spoken about Cory Doctorow before – he is the foremost advocate for the new online publishing paradigm. And I’ll keep talking about him while he’s experimenting with online media. Recently he blogged about all the free translations of his books. The books are released under a Creative Commons license, so they are free for all to translate into their language of choice as long as the translation adheres to the CC licence.

This is smart internets. Facebook famously and successfully translated their interface in a similar manner for a reason – it works. And as Cory riffs repeatedly (originally in 2006):

That’s because my biggest threat as an author isn’t piracy, it’s obscurity. The majority of ideal readers who fail to buy my book will do so because they never heard of it, not because someone gave them a free electronic copy.

I think translators should be taking note of this concept. And now you can see the quality of some translators via their translation of Cory’s works. It’s essentially the same as free and open source software (and lets face it, a lot of great punk bands) – get known by virtue of your work, it’s the easiest way to prove yourself.

Wordfast free online TM!

Wordfast is a TEnT that I’ve had education licences for this year, but we never used them in class – we were struggling to get Trados working on our system for the whole semester and asking even more of the IT team seemed a stretch. If the course is still taught next year, I’ll try to get it installed.

The thing that struck me as strange about Wordfast was that in this day and age, I needed to send snail mail on the University’s letterhead, signed by the head of school, to get the (admittedly, free) Education licences. I admit to dismissing them slightly, although I was tickled at having my first opportunity to send paper mail to France.

Well, now they have announced FreeTM.com, basically a TEnT online. Gone are the days of needing to be near your home computer or laptop to get work done. Translators can upload a million TUs in their TMs, the TMs are not shared with anyone, the Terms of Use are *wonderfully* succinct and understandable. And it really is free – you just register/sign in and off you go. There are full backup and download facilities to make up for the 10 document/million TU limits, which in some regards placates my concerns about how long it will last and how long it will remain free.

I’m not a translator, but it looks very useful. I’d love to know what it’s short-comings are – I presume it’s functionality is stripped back compared to a full suite like OmegaT or Trados?

I also discovered yesterday that if you want a free license to the desktop version of Wordfast, you only need to translate their Wikipedia entry, obviously this opportunity is only available if the page hasn’t already been translated into your L1/L2.

Short cuts

As you can see, I’m still struggling to keep up with all the news that’s flying around. I thought I’d share a collection of links that have swum before my eyes over the last fortnight that can only really be classed as entertainment.

There have also been other more serious posts:

  • Metaglossia links to a new book on translating and interpreting for Social Activism. I hadn’t heard of MetaGlossia before, and I had higher hopes for it. Unfortunately it seems to be not much more than a collection of press releases. I’m fascinated by the topic of the book, however, and know many people who potentially could learn from it – but is it worth it? As far as I can see, it’s just a collection of essays – I’d be interested to hear the thoughts of those that have read it regardless.
  • I don’t think enough people make the connections when it comes to search, MT and grammar analysis, but I do make those connections – Slashdot is reporting on some new Plagiarism in Research software. As with many Slashdot postings, it seems to be under analysed before posting – “search and destroy” is overly aggressive language regards plagiarism in my book, but I would say that, I loved that the Situationists plagiarised Isidore Ducasse.
  • TAUS has a fantastic pictorial history of Machine Translation: A Translation Automation Timeline. It’s quite nerdy, but shows quite well the growth in the industry over time – starting in 1945, it largely focuses on researchers, by 2010 most facts revolve around corporate involvement.
  • At some point I was pointed to Moses for Mere Mortals – I’ve talked about the machine translation engine Moses in a previous posts, in particular how difficult it is for the non technical to set up. This packages contains both Windows and Ubuntu packages that will make testing Moses a lot easier for the non technical. It’s also handy for getting up to speed on just what is required in the MT toolchain – given that most people are put off by the set up alone, the last few yards (using corpora for MT training for instance) are often missed or under analysed. I’ve not tried it yet myself, but you can consider this post a TODO.
  • I’ve never read her novels, but Zadie Smith writes wonderfully. I recently saw The Social Network and her analysis was interesting reading: Generation Why – it’s that kind of film. And that lead me to a fascinating article she wrote on the voice just after Obama was elected – Speaking in Tongues:

    My own childhood had been the story of this and that combined, of the synthesis of disparate things. It never occurred to me that I was leaving the London district of Willesden for Cambridge. I thought I was adding Cambridge to Willesden, this new way of talking to that old way. Adding a new kind of knowledge to a different kind I already had. And for a while, that’s how it was: at home, during the holidays, I spoke with my old voice, and in the old voice seemed to feel and speak things that I couldn’t express in college, and vice versa. I felt a sort of wonder at the flexibility of the thing. Like being alive twice.

    But flexibility is something that requires work if it is to be maintained. Recently my double voice has deserted me for a single one, reflecting the smaller world into which my work has led me. Willesden was a big, colorful, working-class sea; Cambridge was a smaller, posher pond, and almost univocal; the literary world is a puddle. This voice I picked up along the way is no longer an exotic garment I put on like a college gown whenever I choose—now it is my only voice, whether I want it or not. I regret it; I should have kept both voices alive in my mouth. They were both a part of me. But how the culture warns against it! As George Bernard Shaw delicately put it in his preface to the play Pygmalion, “many thousands of [British] men and women…have sloughed off their native dialects and acquired a new tongue.”

And I think I will leave you with that for tonight. I’ve just been conversing with a wonderful academic from Indonesia who has given me permission to re-print her essay on the mutability of Bahasa Indonesia – look forward to it.

Small thoughts on the computer’s use of language

I’ve not posted in a while for a number of personal reasons, but I thought I’d share (somewhat shallowly, due to ongoing time restraints) some concepts about how a computer uses language, and the lovely synchronicity that my work in translation as a computer scientist provides. I apologise for not going into the details of these concepts further but it’s hard to know where to start or stop, and I still have a mountain of other work on my todo list. Hopefully I will be able to come back to these concepts later to give better examples and explanations.

For the last few months, I’ve been working part time on a new project for Monash University called Windows on Australia, with the admittedly elegant, yet cumbersome, subtitle “Perceptions in and through translation”. My role has been IT consultation – database and web interface construction (note that I said construction, not design).

During this project I’ve had some luxuries not usually afforded independent web development – time and flexibility being the most noticeable. I’ve had the pleasure of being able to present something early in the process that would be more portable than the original suggestion of spreadsheets, to a bunch of technically savvy, bilingual translation students. By using feedback from the students in regards to the interface and data schema design (indirectly via questions à la “how do I…” or “what should we do when…”), having the time, and using a flexible web framework (Django) we have got it quickly and easily to the point where it is almost ready for ‘launch’. We have now stopped data entry and development – this was a pilot project that is now being further funded to be merged into the definitive Australian literature database, the AustLit project.

In the course of the WoA dev process, I was asked to add a ‘genre’ variable to the book objects (actually, it was probably worded more like: “Can you make it so we can add Genres to books please?”). While adding a new string to a database object is a simple process, it seemed obvious to me that what computer scientists call tags would be a more appropriate model – the functionality desired by adding genres to the texts was exactly what tags offered.

After doing some research, I decided on using django-tagging, which may not be the best option, but was easy to implement and would last the distance that the project required. The most interesting part of implementing tags was explaining to the research assistants how to do the data entry in the tags field. I left it simple at the time, but earlier this week I was compelled to talk about terms that aren’t used very often.

Here’s the quote from the overview.txt:

Tag input
———

Tag input from users is treated as follows:

* If the input doesn’t contain any commas or double quotes, it is simply
treated as a space-delimited list of tag names.

* If the input does contain either of these characters, we parse the
input like so:

* Groups of characters which appear between double quotes take
precedence as multi-word tags (so double quoted tag names may
contain commas). An unclosed double quote will be ignored.

* For the remaining input, if there are any unquoted commas in the
input, the remainder will be treated as comma-delimited. Otherwise,
it will be treated as space-delimited.

Examples:

Tag input string Resulting tags Notes
apple ball cat [“apple“], [“ball“], [“cat“] No commas, so space delimited
apple, ball cat [“apple“], [“ball cat“] Comma present, so comma delimited
“apple, ball” cat dog [“apple, ball“], [“cat“], [“dog“] All commas are quoted, so space delimited
“apple, ball”, cat dog [“apple, ball“], [“cat dog“] Contains an unquoted comma, so comma delimited
apple “ball cat” dog [“apple“], [“ball cat“], [“dog“] No commas, so space delimited
“apple” “ball dog [“apple“], [“ball“], [“dog“] Unclosed double quote is ignored

As you can see, there are a lot of if/then cases that can be hard to decipher if you aren’t used to thinking like a computer/computer scientist. One of the most fundamental things that we are taught in computer science classes is Parsing – which is literally “how to read text”. By using Regular Expressions, or regexes, a parser breaks the text (any text blob: sentence, paragraph, book, line of code, folder(s) containing many files of many lines of code) into atomic pieces (vaguely literary eg: from book to page, from page to paragraph, from paragraph to sentence, from sentence to word) which can then be acted upon. You often hear about how we write code and this is turned into zeros and ones that a computer can understand by a compiler. The compiler parses the code that has been written and applies a long list of rules based on what the grammar of that particular computer language is, the end result being the zeros and ones.

This leads to the idea of delimiters – the commas in a ‘comma separated values’ or csv file for example – and other characters commonly used by parsers as delimiters, like the pipe character.

As you can see, it brings together a number of ideas that are common across languages – whether they are human or computer based. It has always been fascinating to me that names like Chomsky (see: Chomsky Hierarchy) and Hofstadter (see: Gödel, Escher and Bach or Le Ton beau de Marot) are included in these theories – people who I’d read previously for their mathematical, political, musical or artistic contributions to science – and how all of these subjects are intertwined. If only I had more time…

Endangered languages on YouTube

Without a doubt, one of the greatest gifts that the Internet has given us is access to information that previously was in the public domain but inaccessible. As a young boy (with a scientist for a father) I remember dreaming with jealousy, wistfulness and excitement about visiting institutions like the British Library, London’s Natural History Museum, The Smithsonian or The Louvre – basically anywhere with a sense of grandeur, coupled with learning opportunities that were a combination of history, art, science and spectacle.

There’s no denying that a lot of these institutions held such regard in my young mind because they were also foreign – how or when was I going to get the opportunity or time to go to Europe or the Americas? If I went with my parents, how would I convince them that I could spend a week in each?

While it lacks the frisson of the unattainable, Google’s efforts to preserve endangered languages on YouTube does now give access to the other attractor: the information. Songs, spoken word and non verbal communication methods from Siberia, Papua New Guinea, Africa and South America are featured – including Tuvan throat singing, which had arguably already been immortalised using the tried and true method of associating itself with the best self documentation movement around – punk.

In a funny way, it reminds me of visiting the doctor’s surgery as a child, but instead of out-dated and dated-looking National Geographic magazines (which remained fascinating regardless, like Grandma’s fifteen year old encyclopedia set), we have this Enduring Voices project, we have Fotopedia‘s Heritage project for mobile platforms (ok, that’s more a desire than a reality – it’s only available on the Apple platform, #boohiss), Flickr’s Map and Interesting Photos and, of course Wikipedia.

Imagination food has never been so easily accessible – I wish I was as time rich as I was twenty years ago.