Unicode’s “right-to-left” override can be used to hide malware

Scary news for unicode – a very interesting attack vector has been discovered for those that want access to your information or computer – using the Unicode character U+202E, otherwise known as Right-to-left override or RLO:

this can (and is) also used by malware creeps to disguise the names of the files they attach to their phishing emails. For example, the file “CORP_INVOICE_08.14.2011_Pr.phylexe.doc” is actually “CORP_INVOICE_08.14.2011_Pr.phyldoc.exe” (an executable file!) with a U+202e placed just before “doc.”

This is apparently an old attack, but I’ve never seen it, and it’s a really interesting example of the unintended consequences that arise when small, reasonable changes are introduced into complex systems like type-display technology.

As is pointed out in the comments, Cory has made and error – the example file name he should have used was

CORP_INVOICE_08.14.2011_Pr.phylcod.exe

But really, that’s merely a semantic error. The issue has some very interesting side effects, although we will probably not be able to see them anymore, I can imagine they were cleaned up quite quickly:

I copied the program that powers the Windows command prompt (cmd.exe) and successfully renamed it so that it appears as “evilexe.doc” in Windows. When I tried to attach the file to an outgoing Gmail message, Google sent me the usual warning that it doesn’t allow executable files, but the warning message itself was backwards:

“evil ‮”cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this type of file.

The most interesting thing here is something I’ve only just discovered as a result of writing this post. Note that the “backwards” writing I’ve mentioned above is actually different from the text in the original article?

The actual Google warning is this:

evildoc.exe is an executable file. For security reasons, Gmail does not allow you to send this type of file.

Original article has this backwards:

“cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this live” type of file.

Cory has this backwards:

“cod.exe is an executable file. For security reasons, Gmail does not live” allow you to send “this type of file.

Somewhere in the process that the author went through, another invisible character was added to his text: the U+202C, or “pop directional formatting” character, and the wrapping that is involved in the quoting process has started to mess things up. I wonder if the addition of that is dictated by the Unicode standard or whether it was done by the OS (Windows, one would presume) and if done by the OS whether the functionality was weighed up or if it was just used because it worked without considering the outcome? Or was it added by Google?

U+202E

This character is one of the layout controls (pdf) – all of which are invisible operators – that allow bi-directional text. There are seven characters in this group – to provide embedding of bi-directionality up to 61 levels deep:

Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسملة”) (translated into English as “Bismillah”) right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right. However, support for bidirectional text becomes more complicated when text flowing in opposite directions is embedded hierarchically, for example if one quotes an Arabic phrase that in turn quotes an English phrase. Other situations may also complicate this, such as when an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides seven characters (U+200E, U+200F, U+202A, U+202B, U+202C, U+202D, U+202E) to help control these embedded bidirectional text levels up to 61 levels deep.

From the Mapping of Unicode characters wikipedia entry, we can see what the function of each of these characters is:

The render-time directional type of a neutral character can remain ambiguous when the mark is placed on the boundary between directional changes. To address this, Unicode includes two characters that have strong directionality, have no glyph associated with them, and are ignorable by systems that do not process bidirectional text:

  • Left-to-right mark (U+200E)
  • Right-to-left mark (U+200F)

Surrounding a bidirectionally neutral character by the left-to-right mark will force the character to behave as a left-to-right character while surrounding it by the right-to-left mark will force it to behave as a right-to-left character. The behavior of these characters is detailed in Unicode’s Bidirectional Algorithm.

While Unicode is designed to handle multiple languages, multiple writing systems and even text that flows either left-to-right or right-to-left with minimal author intervention, there are special circumstances where the mix of bidirectional text can become intricate—requiring more author control. For these circumstances, Unicode includes five other characters to control the complex embedding of left-to-right text within right-to-left text and vice versa:

  • Left-to-right embedding (U+202A)
  • Right-to-left embedding (U+202B)
  • Pop directional formatting (U+202C)
  • Left-to-right override (U+202D)
  • Right-to-left override (U+202E)