Localised Malware

Trendmicro are reporting seen in the wild localised malware.

The malware strain known as VOBFUS works by copying itself onto removable media like USB sticks with names like porn.exe or sexy.exe. 

This variant also uses file names written in these languages:

  • Arabic
  • Bosnian
  • Chinese
  • Croatian
  • Czech
  • French
  • German
  • Hungarian
  • Italian
  • Korean
  • Persian
  • Polish
  • Portuguese
  • Romanian
  • Slovak
  • Spanish
  • Thai
  • Turkish
  • Vietnamese

While the languages may differ, they all translate to I love youNakedPassword, and Webcam.

I’m surprised that Malware is still a thing at times but then I remember that the whole world is online these days – as this development shows.

OmegaT developers offer free hosting for teams

The latest OmegaT news is interesting:

This mail concerns all the teams who work on OmegaT localization.

With all the recent activity on the list, you must be aware that OmegaT 2.6 now offers the ability to easily work in teams over the internet.
The function has been discussed at length here and is also very clearly detailed in blog posts written by 2 very active members of the OmegaT community:


As you can see, the SVN/GIT server setting is the hardest part, including the fact that it is not trivial to find free and professional SVN server hosting services.

So, let me inform you that Didier (in fact Didier Briel Consulting and PnS Concept) is offering all the OmegaT l10n teams (ie languages where 2 or more people work on the localization) professional grade hosting for free with unlimited bandwidth.

The French localization team has been using the service for a few months now and it works like a breeze.

I strongly suggest that all the teams move to such a system because it tremendously eases the translation process when a number of people are involved.


Note that this is only for those translating the OmegaT software itself – but is an interesting business in general – surely there is room in the market for such a service?

PunyCode – internationalising the web

I signed up to the JavaScript Weekly email blast a few weeks ago, and have enjoyed flicking through it’s range of stories. I was interested this week to see punycode.js was released – a small JavaScripit library that does PunyCode conversions. What is PunyCode?

When the Internet, and computers for that matter, were first being developed, no one thought so far ahead as to add language neutrality – everything was in English. That’s not entirely true – it was all in ASCII, but is a story for another day. I presume the thinking either just wasn’t there, or it was considered a problem to be addressed when it arose. I find it hard to be overly critical of this approach – while it is culturally insensitive and US centric, no one had any idea what massive changes were about to be unleashed on to the whole planet as a result of their research.

The situation has come a long way since then – from a thriving Internationalisation and Localisation industry to the subtitling underground, the world is now online.

The one thing missing of course, were foreign (ie, non English alphabet) characters in URLs or domain names, but

(i)n late 2009, the Internet Corporation for Assigned Names and Numbers (ICANN) approved the creation of internationalized country code top-level domains (IDN ccTLDs) in the Internet that use the IDNA standard for native language scripts.

Essentially, non ASCII scripts like Chinese characters and Arabic scripts, we approved in URLs. A great day for an international internet – with Egypt, the Russian Federation, Saudi Arabia, and the United Arab Emirates being the first countries to have the opportunity. Of course, there was one remaining problem – that the whole software stack between bare metal and the browser was written expecting ASCII characters, to re-write it all would take an unacceptable amount of time and the perceived increase in complexity would potentially make some quite svelte software overbloated. Computer Scientists ended up doing what they do best – route around the problem.

This is where PunyCode comes in – it’s designed to map the Unicode character set into ASCII for lower level software to understand. It uses complex mathematics to do this that you are welcome to attempt to understand in the three main RFCs that address this issue: 3492: Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications, 5891: Internationalized Domain Names in Applications (IDNA): Protocol, and the less useful to non-nerds than it sounds 5894: Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale.

Luckily, that’s what I’m here for – although it should be note that RFCs adhere to templating standards that put almost every other academic journal on the planet to shame – computer scientists and engineers have this thing about exactness I guess.

I’ll start with the a little of the backgrounding, with examples following.

Users have expectations about character matching or equivalence that are based on their own languages and the orthography of those languages. These expectations may not always be met in a global system, especially if multiple languages are written using the same script but using different conventions. Some examples:
o A Norwegian user might expect a label with the ae-ligature to be treated as the same label as one using the Swedish spelling with a-diaeresis even though applying that mapping to English would be astonishing to users.

o A German user might expect a label with an o-umlaut and a label that had “oe” substituted, but was otherwise the same, to be treated as equivalent even though that substitution would be a clear error in Swedish.

o A Chinese user might expect automatic matching of Simplified and Traditional Chinese characters, but applying that matching for Korean or Japanese text would create considerable confusion.

o An English user might expect “theater” and “theatre” to match.

Some examples pulled from RFC3492:

7.1 Sample strings

In the Punycode encodings below, the ACE prefix is not shown. Backslashes show where line breaks have been inserted in strings too long for one line.

The first several examples are all translations of the sentence “Why can’t they just speak in <language>?” (courtesy of Michael Kaplan’s “provincial” page [PROVINCIAL]). Word breaks and punctuation have been removed, as is often done in domain names.
(A) Arabic (Egyptian):
u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644 u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
Punycode: egbpdaj6bu4bxfgehfvwxn

(B) Chinese (simplified):
u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
Punycode: ihqwcrb4cv8a8dqg056pqjye

(C) Chinese (traditional):
u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
Punycode: ihqwctvzc91f659drss3x8bo0yb

(D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute <ccaron>esky
U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D u+0065 u+0073 u+006B u+0079
Punycode: Proprostnemluvesky-uyb24dma41a

(E) Hebrew:
u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8 u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2 u+05D1 u+05E8 u+05D9 u+05EA
Punycode: 4dbcagdahymbxekheh6e0a7fei0b

(F) Hindi (Devanagari):
u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939 u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947 u+0939 u+0948 u+0902
Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd

(G) Japanese (kanji and hiragana):
u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092 u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa

(H) Korean (Hangul syllables):
u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\

(I) Russian (Cyrillic):
U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A u+0438
Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l

(J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 u+0061 u+00F1 u+006F u+006C
Punycode: PorqunopuedensimplementehablarenEspaolfmd56a

(K) Vietnamese:
U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068 u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067 U+0056 u+0069 u+1EC7 u+0074
Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g

The next several examples are all names of Japanese music artists, song titles, and TV programs, just because the author happens to have them handy (but Japanese is useful for providing examples of single-row text, two-row text, ideographic text, and various mixtures thereof).

(L) 3<nen>B<gumi><kinpachi><sensei>
u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
Punycode: 3B-ww4c5e180e575a65lsy2b

(M) <amuro><namie>-with-SUPER-MONKEYS
u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D U+004F U+004E U+004B U+0045 U+0059 U+0053
Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n

(N) Hello-Another-Way-<sorezore><no><basho>
U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
Punycode: Hello-Another-Way–fc4qua05auwb3674vfr0b

(O) <hitotsu><yane><no><shita>2
u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
Punycode: 2-u9tlzr9756bt3uc0v

(P) Maji<de>Koi<suru>5<byou><mae>
U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059 u+308B u+0035 u+79D2 u+524D
Punycode: MajiKoi5-783gue6qz075azm5e

(Q) <pafii>de<runba>
u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
Punycode: de-jg4avhby1noc0d

(R) <sono><supiido><de>
u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
Punycode: d9juau41awczczp

The last example is an ASCII string that breaks the existing rules for host name labels. (It is not a realistic example for IDNA, because IDNA never encodes pure ASCII labels.)

(S) -> $1.00 <-
u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 u+003C u+002D
Punycode: -> $1.00 <–

I hope this has been as enlightening for you as it has for me – I was unaware of PunyCode before today as well.

Chrome Language Detection

Google’s Chrome browser has a built in function for detecting the language of a website and offering a translation of the site if the language isn’t in your local language (and Google translates between those languages) – roughly 64 languages iirc.

Known as Compact Language Detection (CLD), it’s been extracted from the Open Source browser code base by blogger Mike McCandless, and ported into a stand alone product on Google code that can now be integrated into any c++ project, as well as some simple Python bindings.

It’s also not clear just how many languages it can detect; I see there are 161 “base” languages plus 44 “extended” languages, but then I see many test cases (102 out of 166!) commented out.  This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.

Excitingly, since it was first posted, Mike has a couple more posts on this library – this one details the addition of some Python constants and a new method removeWeakMatches and another that compares the accuracy and performance between CLD, and two java based projects: the Apache Tika project and the language-detection project:

Some quick analysis:

  • The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. Net/net these accuracies are very good, especially considering how short some of the tests are!
  • The difficult languages are Danish (confused with Norwegian), Slovene (confused with Croatian) and Dutch (for Tika and language-detection). Tika in particular has trouble with Spanish (confuses it with Galician). These confusions are to be expected: the languages are very similar.

When language-detection was wrong, Tika was also wrong 37% of the time and CLD was also wrong 23% of the time. These numbers are quite low! It tells us that the errors are somewhat orthogonal, i.e. the libraries tend to get different test cases wrong. For example, it’s not the case that they are all always wrong on the short texts.

This means the libraries are using different overall signals to achieve their classification (for example, perhaps they were trained on different training texts). This is encouraging since it means, in theory, one could build a language detection library combining the signals of all of these libraries and achieve better overall accuracy.

You could also make a simple majority-rules voting system across these (and other) libraries. I tried exactly that approach: if any language receives 2 or more votes from the three detectors, select that as the detected language; otherwise, go with language-detection choice. This gives the best accuracy of all: total 99.59% (= 16930 / 17000)!

Finally, I also separately tested the run time for each package. Each time is the best of 10 runs through the full corpus:

CLD  171 msec  16.331 MB/sec
language-detection  2367 msec  1.180 MB/sec
Tika  42219 msec  0.066 MB/sec

CLD is incredibly fast! language-detection is an order of magnitude slower, and Tika is another order of magnitude slower (not sure why).

Lictionary: shipping is everything

Yesterday I wrote about Lictionary, “a localization dictionary that presents information repository which is constituted by free softwares” and noted that one of it’s strengths was that it had shipped. What does that mean? In software and technology, it’s generally understood that “release early, release often” gives distinct advantages – failure comes earlier and easier, feedback loops with users are smaller and quicker, the software project will look “alive” to developers and users, and no uses software or sites that never made it out the door. There are plenty of articles about the advantages – here’s one by Matt Mullenweg, a founding developer of the software this site runs, WordPress.

Anyway – after the write up I gave them yesterday I popped over to their site to let them know what I thought and of the most glaring errors – the CDATA and the incorrect language attribution. Pleasingly there was a response in my inbox almost immediately from Türker at TSDesigns. The language attribution error has been fixed already (at least for Indonesian, I’ve not done any further testing), and they certainly are shipping – from what they have said, they only started collecting data last week:

We have started to collect our data last week. We are choosing and indexing a lot of repositories first time. After this first phase completed, we will identify problematic issues and eliminate these problems. CDATA problem is a sample of these situations. We are discussing about this. We can parse CDATA or skip them.

Manually choosing best translation is so hard. There are too many entries in system and there are several “best” translation for some strings in different contexts. So we added voting system. Translations are sorted by vote count. And we hide translations which have many negative votes. We will show best translations at the top with support of our users in the future. And also there is a trick in voting system. We add a positive vote to translations for each file. So mostly used translations have a head start.

Nothing makes me more excited than responsive developers. Can’t wait to see where this goes.

Lictionary: a localisation catalogue

I was thinking about different ways to aggregate all the GPL’d localisation data available online just last week when an email landed in my box via the Django localisation email list informing that Lictionary was now live.

For today, Lictionary.in contains ~160.000 unique strings and ~2.4 million translations in dozens of different languages and grows day by day.

The front page has the now ubiquitous search box, in which you enter a string, choose the language you wish it translated to in the drop down box next to the search bar, and hit go. I started with the simple “Enter” into Indonesian, and soon noted a couple of errors – large chunks of CDATA and a line noting that “201 result(s) found for “Enter” in Bengali“.

Is it perfect? No. Has it shipped? Yes – and that’s the most important thing, presuming they can keep it up. I’d be very interested to see if it would be possible to integrate with Tatoeba since they are delivering a similar product. Lictionary has the advantage of the thousands of translation files available with FLOSS software, but Tatoeba has a nicer interface.

Another concern is that Lictionary depends upon the correctness of the underlying files – any mistakes now need to go through Lictionary, then onto the software project from which they came. The FAQ briefly touches on this, but not enough to fill me with confidence just yet:

– Some translations seems wrong, what can i do?

You can give negative vote for this translation. We inform translators or translation teams periodically about negative voted translations. If you want to inform translator immediately, you can contact the translator or translation team directly or may be file a bug in bufg tracker of related project.

Also, the next most useful thing would certainly be to submit a selection of strings and have the best translations returned – localising software one string at a time would be tiresome for the monolingual software engineer. This is also addressed in the FAQ:

– Is there any other way except webpage to use Lictionary?

Unfortunately, no. You can only use our website to search in our database. We are developing web service interfaces for developers. Soon, we will publish technical details and documentation about these.

I look forward to seeing how this project develops and will be sure to report in as it improves.

Unicode’s “right-to-left” override can be used to hide malware

Scary news for unicode – a very interesting attack vector has been discovered for those that want access to your information or computer – using the Unicode character U+202E, otherwise known as Right-to-left override or RLO:

this can (and is) also used by malware creeps to disguise the names of the files they attach to their phishing emails. For example, the file “CORP_INVOICE_08.14.2011_Pr.phylexe.doc” is actually “CORP_INVOICE_08.14.2011_Pr.phyldoc.exe” (an executable file!) with a U+202e placed just before “doc.”

This is apparently an old attack, but I’ve never seen it, and it’s a really interesting example of the unintended consequences that arise when small, reasonable changes are introduced into complex systems like type-display technology.

As is pointed out in the comments, Cory has made and error – the example file name he should have used was


But really, that’s merely a semantic error. The issue has some very interesting side effects, although we will probably not be able to see them anymore, I can imagine they were cleaned up quite quickly:

I copied the program that powers the Windows command prompt (cmd.exe) and successfully renamed it so that it appears as “evilexe.doc” in Windows. When I tried to attach the file to an outgoing Gmail message, Google sent me the usual warning that it doesn’t allow executable files, but the warning message itself was backwards:

“evil ‮”cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this type of file.

The most interesting thing here is something I’ve only just discovered as a result of writing this post. Note that the “backwards” writing I’ve mentioned above is actually different from the text in the original article?

The actual Google warning is this:

evildoc.exe is an executable file. For security reasons, Gmail does not allow you to send this type of file.

Original article has this backwards:

“cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this live” type of file.

Cory has this backwards:

“cod.exe is an executable file. For security reasons, Gmail does not live” allow you to send “this type of file.

Somewhere in the process that the author went through, another invisible character was added to his text: the U+202C, or “pop directional formatting” character, and the wrapping that is involved in the quoting process has started to mess things up. I wonder if the addition of that is dictated by the Unicode standard or whether it was done by the OS (Windows, one would presume) and if done by the OS whether the functionality was weighed up or if it was just used because it worked without considering the outcome? Or was it added by Google?


This character is one of the layout controls (pdf) – all of which are invisible operators – that allow bi-directional text. There are seven characters in this group – to provide embedding of bi-directionality up to 61 levels deep:

Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسملة”) (translated into English as “Bismillah”) right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right. However, support for bidirectional text becomes more complicated when text flowing in opposite directions is embedded hierarchically, for example if one quotes an Arabic phrase that in turn quotes an English phrase. Other situations may also complicate this, such as when an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides seven characters (U+200E, U+200F, U+202A, U+202B, U+202C, U+202D, U+202E) to help control these embedded bidirectional text levels up to 61 levels deep.

From the Mapping of Unicode characters wikipedia entry, we can see what the function of each of these characters is:

The render-time directional type of a neutral character can remain ambiguous when the mark is placed on the boundary between directional changes. To address this, Unicode includes two characters that have strong directionality, have no glyph associated with them, and are ignorable by systems that do not process bidirectional text:

  • Left-to-right mark (U+200E)
  • Right-to-left mark (U+200F)

Surrounding a bidirectionally neutral character by the left-to-right mark will force the character to behave as a left-to-right character while surrounding it by the right-to-left mark will force it to behave as a right-to-left character. The behavior of these characters is detailed in Unicode’s Bidirectional Algorithm.

While Unicode is designed to handle multiple languages, multiple writing systems and even text that flows either left-to-right or right-to-left with minimal author intervention, there are special circumstances where the mix of bidirectional text can become intricate—requiring more author control. For these circumstances, Unicode includes five other characters to control the complex embedding of left-to-right text within right-to-left text and vice versa:

  • Left-to-right embedding (U+202A)
  • Right-to-left embedding (U+202B)
  • Pop directional formatting (U+202C)
  • Left-to-right override (U+202D)
  • Right-to-left override (U+202E)


Wikipedia in Arabic?: ويكيبيديا: الموسوعة الحرة

@gr33ndata retweeted this tweet about Wikipedia in Arabic last night:

154,000 Arabic Wikipedia entries. 374m Arabs! How can we learn when we don’t share what we know? Yalla, wikiArabic! http://bit.ly/reu0H0

I followed it through and found that there was a transcription available, but not as subtitles. To activate this function, click the transcription button (circled). I wish there was some easy Universal Subtitles plugin that would just transform the transcript to a subtitle!

Transcription button in YouTube

Transcription button in YouTube