Google, voice recognition and search

The London Review of Books has an interesting look at Google titled It knows. The article discusses how much Google knows, and what it’s doing with that information, instead of offering all of it to you in search, it’s keeping some back:

The reason is that Google is learning. The more data it gathers, the more it knows, the better it gets at what it does. Of course, the better it gets at what it does the more money it makes, and the more money it makes the more data it gathers and the better it gets at what it does – an example of the kind of win-win feedback loop Google specialises in – but what’s surprising is that there is no obvious end to the process.

While I’m less eloquent, or have been less able to pinpoint it so effectively, it is a refrain you hear a lot on this blog. I’ve posted previously about how Google has learnt semantics -this article follows up that anecdote with a fascinating and insightful description of GOOG-411, the voice search service briefly offered by Google, making the now obvious reasons for its existence a lot clearer. Given yesterday’s launch of SIRI by Apple, it is quite a timely reminder that there is more than just one player in the voice recognition field and Apple wasn’t even the first to it:

By 2007, Google knew enough about the structure of queries to be able to release a US-only directory inquiry service called GOOG-411. You dialled 1-800-4664-411 and spoke your question to the robot operator, which parsed it and spoke you back the top eight results, while offering to connect your call. It was free, nifty and widely used, especially because – unprecedentedly for a company that had never spent much on marketing – Google chose to promote it on billboards across California and New York State. People thought it was weird that Google was paying to advertise a product it couldn’t possibly make money from, but by then Google had become known for doing weird and pleasing things. In 2004, it launched Gmail with what was for the time an insanely large quota of free storage – 1GB, five hundred times more than its competitors. But in that case it was making money from the ads that appeared alongside your emails. What was it getting with GOOG-411? It soon became clear that what it was getting were demands for pizza spoken in every accent in the continental United States, along with questions about plumbers in Detroit and countless variations on the pronunciations of ‘Schenectady’, ‘Okefenokee’ and ‘Boca Raton’. GOOG-411, a Google researcher later wrote, was a phoneme-gathering operation, a way of improving voice recognition technology through massive data collection.

Three years later, the service was dropped, but by then Google had launched its Android operating system and had released into the wild an improved search-by-voice service that didn’t require a phone call. You tapped the little microphone icon on your phone’s screen – it was later extended to Blackberries and iPhones – and your speech was transmitted via the mobile internet to Google servers, where it was interpreted using the advanced techniques the GOOG-411 exercise had enabled. The baby had learned to talk.

But success wasn’t immediate. And failure is often the best way to learn – it forces us to adapt.

Before Google bought YouTube in 2006 for $1.65 billion, it had a fledgling video service of its own, predictably called Google Video, that in its initial incarnation offered the – it seemed – brilliant feature of answering a typed phrase with a video clip in which those words were spoken. The promise was that, for example, you’d be able to search for the phrase ‘in my beginning is my end’ and see T.S. Eliot, on film, reciting from the Four Quartets. But no such luck. Google Video’s search worked by a kind of trickery: it used the hidden subtitles that broadcasters provide for the hard of hearing, which Google had generally paid to use, and searched against the text. The service is just one of the many experiments that Google over the years has killed, but a presumably large reason for its death was that although it appeared to work it was really very limited. Not everything is tailored for the deaf, and subtitles are often wrong. If, however, Google is able to deploy its newly capable voice recognition system to transcribe the spoken words in the two days’ worth of video uploaded to YouTube every minute, there would be an explosion in the amount of searchable material. Since there’s no reason Google can’t do it, it will.

The final part of the article bemoans the size of Google:

Google is getting cleverer precisely because it is so big. If it’s cut down to size then what will happen to everything it knows? That’s the conundrum. It’s clearly wrong for all the information in all the world’s books to be in the sole possession of a single company. It’s clearly not ideal that only one company in the world can, with increasing accuracy, translate text between 506 different pairs of languages. On the other hand, if Google doesn’t do these things, who will?

Which is a legitimate concern, no doubt. Who needs a one world Government when society can just be taken over by a large corporation by stealth? Having said that, there’s no reason why we can’t live together in harmony, this society of ours and Google. I just think Google will have to give back in return for what it’s taken from us – make the maps free. Make the translations free. Keep the search free – and even open it’s heuristics. Am I asking too much? Am I not being cynical enough? My inner anarchist is squeamish at the thought of allowing it to happen, but my inner futurist is excited at its possibilities.

Accents and invasive phrases

As so often happens with this blog, I start with one page, and end up on something much more interesting. Yesterday on twitter someone mentioned this article on the Australian language taking over the world, which I thought would be interesting but ended up primarily being a light and fluffy piece with a few anecdotes and some admittedly interesting google trend data on phrase usage – personally, I feel a little bit responsible for (and proud of) the spread of ‘no worries‘.

But the very last link in that article promised the tantalising

(By the way, if you want to get ahead of the game you can learn more about how to speak proper English here.)

Follow the link. Go on. And what you find is a treasure trove of “how to speak with an X accent” videos by a group called VideoJug who seem to be an online tv station using Youtube as their broadcast partner. Other accents taught include Scottish, German, French, Russian, Cockney, British, Irish, South African, New York, American - even how to loose your native accent.

Unicode’s “right-to-left” override can be used to hide malware

Scary news for unicode – a very interesting attack vector has been discovered for those that want access to your information or computer – using the Unicode character U+202E, otherwise known as Right-to-left override or RLO:

this can (and is) also used by malware creeps to disguise the names of the files they attach to their phishing emails. For example, the file “CORP_INVOICE_08.14.2011_Pr.phylexe.doc” is actually “CORP_INVOICE_08.14.2011_Pr.phyldoc.exe” (an executable file!) with a U+202e placed just before “doc.”

This is apparently an old attack, but I’ve never seen it, and it’s a really interesting example of the unintended consequences that arise when small, reasonable changes are introduced into complex systems like type-display technology.

As is pointed out in the comments, Cory has made and error – the example file name he should have used was

CORP_INVOICE_08.14.2011_Pr.phylcod.exe

But really, that’s merely a semantic error. The issue has some very interesting side effects, although we will probably not be able to see them anymore, I can imagine they were cleaned up quite quickly:

I copied the program that powers the Windows command prompt (cmd.exe) and successfully renamed it so that it appears as “evilexe.doc” in Windows. When I tried to attach the file to an outgoing Gmail message, Google sent me the usual warning that it doesn’t allow executable files, but the warning message itself was backwards:

“evil ‮”cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this type of file.

The most interesting thing here is something I’ve only just discovered as a result of writing this post. Note that the “backwards” writing I’ve mentioned above is actually different from the text in the original article?

The actual Google warning is this:

evildoc.exe is an executable file. For security reasons, Gmail does not allow you to send this type of file.

Original article has this backwards:

“cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this live” type of file.

Cory has this backwards:

“cod.exe is an executable file. For security reasons, Gmail does not live” allow you to send “this type of file.

Somewhere in the process that the author went through, another invisible character was added to his text: the U+202C, or “pop directional formatting” character, and the wrapping that is involved in the quoting process has started to mess things up. I wonder if the addition of that is dictated by the Unicode standard or whether it was done by the OS (Windows, one would presume) and if done by the OS whether the functionality was weighed up or if it was just used because it worked without considering the outcome? Or was it added by Google?

U+202E

This character is one of the layout controls (pdf) – all of which are invisible operators – that allow bi-directional text. There are seven characters in this group – to provide embedding of bi-directionality up to 61 levels deep:

Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسملة”) (translated into English as “Bismillah”) right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right. However, support for bidirectional text becomes more complicated when text flowing in opposite directions is embedded hierarchically, for example if one quotes an Arabic phrase that in turn quotes an English phrase. Other situations may also complicate this, such as when an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides seven characters (U+200E, U+200F, U+202A, U+202B, U+202C, U+202D, U+202E) to help control these embedded bidirectional text levels up to 61 levels deep.

From the Mapping of Unicode characters wikipedia entry, we can see what the function of each of these characters is:

The render-time directional type of a neutral character can remain ambiguous when the mark is placed on the boundary between directional changes. To address this, Unicode includes two characters that have strong directionality, have no glyph associated with them, and are ignorable by systems that do not process bidirectional text:

  • Left-to-right mark (U+200E)
  • Right-to-left mark (U+200F)

Surrounding a bidirectionally neutral character by the left-to-right mark will force the character to behave as a left-to-right character while surrounding it by the right-to-left mark will force it to behave as a right-to-left character. The behavior of these characters is detailed in Unicode’s Bidirectional Algorithm.

While Unicode is designed to handle multiple languages, multiple writing systems and even text that flows either left-to-right or right-to-left with minimal author intervention, there are special circumstances where the mix of bidirectional text can become intricate—requiring more author control. For these circumstances, Unicode includes five other characters to control the complex embedding of left-to-right text within right-to-left text and vice versa:

  • Left-to-right embedding (U+202A)
  • Right-to-left embedding (U+202B)
  • Pop directional formatting (U+202C)
  • Left-to-right override (U+202D)
  • Right-to-left override (U+202E)

 

Wikipedia in Arabic?: ويكيبيديا: الموسوعة الحرة

@gr33ndata retweeted this tweet about Wikipedia in Arabic last night:

154,000 Arabic Wikipedia entries. 374m Arabs! How can we learn when we don’t share what we know? Yalla, wikiArabic! http://bit.ly/reu0H0

I followed it through and found that there was a transcription available, but not as subtitles. To activate this function, click the transcription button (circled). I wish there was some easy Universal Subtitles plugin that would just transform the transcript to a subtitle!

Transcription button in YouTube

Transcription button in YouTube

All languages except ours

Aside

I only needed to be berated once by an Indonesian friend for using the term ‘bahasa’ interchangeably with ‘Indonesian’ – as in “Do you speak Bahasa”. Bahasa means ‘langauge’ and it made me look silly – “Do you speak a language?” – ‘bahasa Indonesia’ is the correct term. Today I discovered, via Every Word in Icelandic, that they have a word for “all the langauges that are not Icelandic“:

you will notice that when you first go into the töff bar to meet my people, they will be saying very clever and witty things to each other, and you just will not understand any of them. Maybe you will be confused about this, but do not worry. (You are not “crazy in your brain-house”.)

The reason you do not understand them is that they are speaking “íslenska”, and you just do not know how to do this.

You only speak “útlenska”. It means “a language which is not íslenska.”

Cool.

More tech anger at translation attempts

In a situation that almost mirrors the story I wrote last week about Steam and it’s crowd sourced translations, the French authors of a Debian manual have asked the community for money to professionally translate the book, only to have their request shouted down. More than one commentator suggested something along the lines of frostypiss:

I’m not really understanding why it’s going to take 15,000 euro.

It’s a translation, not a new work. Why not piecemeal it out to like-minded French / English speakers, and then self publish or simply post a torrent of the file (free as in…FREE)?

You know, “community effort”?

By the way, 15,000 euro is (today) about 20,000 $.

I understand this critique – it’s one I use all the time – not just that the community could do it, but they can and will do it! Luckily, commentator cp.tar makes the opposite case well:

Do you have any idea how hard it can be for one translator to remain consistent throughout the translation?
Do you have the slightest clue how difficult it would be to actually organize a group translation of such a book?
It is a rather large book, it is highly technical and therefore sensitive to the slightest nuance, and since professional translators are very seldom also highly technically competent, the translation will require frequent consultation with the authors.
All in all, donating money towards the translation is actually more efficient than donating an equivalent amount of your time. Because you are likely not a professional translator. Because you likely do not have the required mastery in both French and English. Because even if the work were divided up and group-translated, it would still have to be reviewed and corrected for grammar, style, and consistency. And trust me, it is often easier to simply trash the whole thing and redo it right, from scratch.

Now, community translation projects can and do function. But they are ongoing projects, often with mistranslated and untranslated parts that keep for ages because nobody had touched or noticed them, and they are often fairly bad.
If you’ve got a big language, such as English or German or Spanish or Chinese (i.e., a language with a large number of well-educated speakers), it’s not all that bad. But in the case of small languages, such as my native Croatian, what you get is crap. And I mean a metric fuckton of crap.
I don’t intend to berate anyone’s work, really. But the problem is that we are a small population (a bit over 4 million), with a lousy percentage of highly educated people, of which few can afford to work for free because our economy is dead, buried, and digging deeper. I’m actually doing some corpus analyses for my thesis (that I’ve been writing, on and off, for over two years) that will help such projects immensely, but I have to get round to it. And when I finally do, I still have to beg my translator friends for a bit of their time, which is at a premium.

I think the important points to note are the large, technical nature of the book (450 pages) and it’s timeliness – computer texts have notoriously short half lives. This is not an interface, nor a subtitling job – lots of small segments that can be easily divided up and wont change much after production of the translation/localisation. Still, I was interested to see the reaction to two gents being honest about wanting to be able to do it, and to be paid to do it. I remember when I spent more time in the activist world, and that the person doing sound at the benefit gig would always be paid. The bands are getting exposure. The fans are paying the cash. The cause is getting some kudos, community and a bump in its funding. The person making sure it sounds great is at their job – they aren’t a charity, and it’s hard, specialised, work that not everyone can do.