Back to n-grams

Recently a co-worker of mine sent this email to our internal trivia list:

this is what you get for doing contextual rather than plain word/phrase machine translation…

> By the way, how frickin’ awesome is Google Translate?  If you type “АЭС ‘Бушер’ (Иран),” it generates “Bushehr nuclear power plant (Iran).”  That’s impressive — the algorithm not only knows that the Russian acronym атомная электростанция (АЭС) — literally “atomic electro station” — is rendered “nuclear power plant (NPP)” in English, but also understands that the Russian style of “NPP Bushehr” should be translated into “Bushehr nuclear power plant.”  Try it!

from a press release by the Russian nuclear regulator on the defueling of the Bushehr nuclear reactor in Iran – news

and analysis of that by some wonks I read –

Which apart from being interesting in itself, reminded me of the flurry of articles that we had on n-grams earlier last year (Google and Semantics, NYTimes on Google Translate) that I had to go searching for to share with my colleagues.

While I was searching, I discovered that Google Labs announced late last year a new project called ngrams:

Google Books has scanned over 10% of all books ever published, and now you can graph the occurrence of phrases up to five words in length from 1400 through the present day right in your browser.

It’s available in six languages: Chinese, English, French, German, Hebrew, and Russian. You can run your own tests at Books Ngram viewer, with more information about the data set (corpora) on the info page. Not especially useful in my hands on a hazy Saturday morning, but I’m sure there are scholars who will find it interesting. I can’t imagine what it’s uses are and I’m finding it hard to think of useful examples to show, but here are a few that I’ve whipped up. Note that the searches are case sensitive, and that having a percentage on the vertical axis doesn’t really give much information unless you go digging for those numbers yourself.

translating, interpreting: Interesting to see that interpreting made a big jump in usage from 1890s.

translating, interpreting,translation: Ah! “translation” wins.

I tried quite a few, with little information of interest. My favourite example of New York, New York Times and New York Times Square is interesting however – remove a term from the front and repeat the search. It quickly becomes obvious that New York at a supposed 387 years, has been around a lot longer than the New York Times at 160 years, and both have been around longer than New York Times Square at the relatively young age of 107 years. I’d imagine that this result is also a function of the size of the signifier – New York represents a whole city, the New York Times is a journalistic reflection of that city that expands beyond the city in content and distribution, yet has less cultural significance than New York – certainly no one has written a New York Times State of MindNew York Times, New York Times or The Only Living Boy in New York Times Square, although that last one could make some good gangsta Hip Hop I guess. And we find that New York Times Square occurs less again – it’s smaller than New York in size, and smaller than the NYT in our imaginations.