Small thoughts on the computer’s use of language

I’ve not posted in a while for a number of personal reasons, but I thought I’d share (somewhat shallowly, due to ongoing time restraints) some concepts about how a computer uses language, and the lovely synchronicity that my work in translation as a computer scientist provides. I apologise for not going into the details of these concepts further but it’s hard to know where to start or stop, and I still have a mountain of other work on my todo list. Hopefully I will be able to come back to these concepts later to give better examples and explanations.

For the last few months, I’ve been working part time on a new project for Monash University called Windows on Australia, with the admittedly elegant, yet cumbersome, subtitle “Perceptions in and through translation”. My role has been IT consultation – database and web interface construction (note that I said construction, not design).

During this project I’ve had some luxuries not usually afforded independent web development – time and flexibility being the most noticeable. I’ve had the pleasure of being able to present something early in the process that would be more portable than the original suggestion of spreadsheets, to a bunch of technically savvy, bilingual translation students. By using feedback from the students in regards to the interface and data schema design (indirectly via questions à la “how do I…” or “what should we do when…”), having the time, and using a flexible web framework (Django) we have got it quickly and easily to the point where it is almost ready for ‘launch’. We have now stopped data entry and development – this was a pilot project that is now being further funded to be merged into the definitive Australian literature database, the AustLit project.

In the course of the WoA dev process, I was asked to add a ‘genre’ variable to the book objects (actually, it was probably worded more like: “Can you make it so we can add Genres to books please?”). While adding a new string to a database object is a simple process, it seemed obvious to me that what computer scientists call tags would be a more appropriate model – the functionality desired by adding genres to the texts was exactly what tags offered.

After doing some research, I decided on using django-tagging, which may not be the best option, but was easy to implement and would last the distance that the project required. The most interesting part of implementing tags was explaining to the research assistants how to do the data entry in the tags field. I left it simple at the time, but earlier this week I was compelled to talk about terms that aren’t used very often.

Here’s the quote from the overview.txt:

Tag input

Tag input from users is treated as follows:

* If the input doesn’t contain any commas or double quotes, it is simply
treated as a space-delimited list of tag names.

* If the input does contain either of these characters, we parse the
input like so:

* Groups of characters which appear between double quotes take
precedence as multi-word tags (so double quoted tag names may
contain commas). An unclosed double quote will be ignored.

* For the remaining input, if there are any unquoted commas in the
input, the remainder will be treated as comma-delimited. Otherwise,
it will be treated as space-delimited.


Tag input string Resulting tags Notes
apple ball cat [“apple“], [“ball“], [“cat“] No commas, so space delimited
apple, ball cat [“apple“], [“ball cat“] Comma present, so comma delimited
“apple, ball” cat dog [“apple, ball“], [“cat“], [“dog“] All commas are quoted, so space delimited
“apple, ball”, cat dog [“apple, ball“], [“cat dog“] Contains an unquoted comma, so comma delimited
apple “ball cat” dog [“apple“], [“ball cat“], [“dog“] No commas, so space delimited
“apple” “ball dog [“apple“], [“ball“], [“dog“] Unclosed double quote is ignored

As you can see, there are a lot of if/then cases that can be hard to decipher if you aren’t used to thinking like a computer/computer scientist. One of the most fundamental things that we are taught in computer science classes is Parsing – which is literally “how to read text”. By using Regular Expressions, or regexes, a parser breaks the text (any text blob: sentence, paragraph, book, line of code, folder(s) containing many files of many lines of code) into atomic pieces (vaguely literary eg: from book to page, from page to paragraph, from paragraph to sentence, from sentence to word) which can then be acted upon. You often hear about how we write code and this is turned into zeros and ones that a computer can understand by a compiler. The compiler parses the code that has been written and applies a long list of rules based on what the grammar of that particular computer language is, the end result being the zeros and ones.

This leads to the idea of delimiters – the commas in a ‘comma separated values’ or csv file for example – and other characters commonly used by parsers as delimiters, like the pipe character.

As you can see, it brings together a number of ideas that are common across languages – whether they are human or computer based. It has always been fascinating to me that names like Chomsky (see: Chomsky Hierarchy) and Hofstadter (see: Gödel, Escher and Bach or Le Ton beau de Marot) are included in these theories – people who I’d read previously for their mathematical, political, musical or artistic contributions to science – and how all of these subjects are intertwined. If only I had more time…