PunyCode – internationalising the web

I signed up to the JavaScript Weekly email blast a few weeks ago, and have enjoyed flicking through it’s range of stories. I was interested this week to see punycode.js was released – a small JavaScripit library that does PunyCode conversions. What is PunyCode?

When the Internet, and computers for that matter, were first being developed, no one thought so far ahead as to add language neutrality – everything was in English. That’s not entirely true – it was all in ASCII, but is a story for another day. I presume the thinking either just wasn’t there, or it was considered a problem to be addressed when it arose. I find it hard to be overly critical of this approach – while it is culturally insensitive and US centric, no one had any idea what massive changes were about to be unleashed on to the whole planet as a result of their research.

The situation has come a long way since then – from a thriving Internationalisation and Localisation industry to the subtitling underground, the world is now online.

The one thing missing of course, were foreign (ie, non English alphabet) characters in URLs or domain names, but

(i)n late 2009, the Internet Corporation for Assigned Names and Numbers (ICANN) approved the creation of internationalized country code top-level domains (IDN ccTLDs) in the Internet that use the IDNA standard for native language scripts.

Essentially, non ASCII scripts like Chinese characters and Arabic scripts, we approved in URLs. A great day for an international internet – with Egypt, the Russian Federation, Saudi Arabia, and the United Arab Emirates being the first countries to have the opportunity. Of course, there was one remaining problem – that the whole software stack between bare metal and the browser was written expecting ASCII characters, to re-write it all would take an unacceptable amount of time and the perceived increase in complexity would potentially make some quite svelte software overbloated. Computer Scientists ended up doing what they do best – route around the problem.

This is where PunyCode comes in – it’s designed to map the Unicode character set into ASCII for lower level software to understand. It uses complex mathematics to do this that you are welcome to attempt to understand in the three main RFCs that address this issue: 3492: Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications, 5891: Internationalized Domain Names in Applications (IDNA): Protocol, and the less useful to non-nerds than it sounds 5894: Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale.

Luckily, that’s what I’m here for – although it should be note that RFCs adhere to templating standards that put almost every other academic journal on the planet to shame – computer scientists and engineers have this thing about exactness I guess.

I’ll start with the a little of the backgrounding, with examples following.

Users have expectations about character matching or equivalence that are based on their own languages and the orthography of those languages. These expectations may not always be met in a global system, especially if multiple languages are written using the same script but using different conventions. Some examples:
o A Norwegian user might expect a label with the ae-ligature to be treated as the same label as one using the Swedish spelling with a-diaeresis even though applying that mapping to English would be astonishing to users.

o A German user might expect a label with an o-umlaut and a label that had “oe” substituted, but was otherwise the same, to be treated as equivalent even though that substitution would be a clear error in Swedish.

o A Chinese user might expect automatic matching of Simplified and Traditional Chinese characters, but applying that matching for Korean or Japanese text would create considerable confusion.

o An English user might expect “theater” and “theatre” to match.

Some examples pulled from RFC3492:

7.1 Sample strings

In the Punycode encodings below, the ACE prefix is not shown. Backslashes show where line breaks have been inserted in strings too long for one line.

The first several examples are all translations of the sentence “Why can’t they just speak in <language>?” (courtesy of Michael Kaplan’s ”provincial” page [PROVINCIAL]). Word breaks and punctuation have been removed, as is often done in domain names.
(A) Arabic (Egyptian):
u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644 u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
Punycode: egbpdaj6bu4bxfgehfvwxn

(B) Chinese (simplified):
u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
Punycode: ihqwcrb4cv8a8dqg056pqjye

(C) Chinese (traditional):
u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
Punycode: ihqwctvzc91f659drss3x8bo0yb

(D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute <ccaron>esky
U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D u+0065 u+0073 u+006B u+0079
Punycode: Proprostnemluvesky-uyb24dma41a

(E) Hebrew:
u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8 u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2 u+05D1 u+05E8 u+05D9 u+05EA
Punycode: 4dbcagdahymbxekheh6e0a7fei0b

(F) Hindi (Devanagari):
u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939 u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947 u+0939 u+0948 u+0902
Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd

(G) Japanese (kanji and hiragana):
u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092 u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa

(H) Korean (Hangul syllables):
u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\
psd879ccm6fea98c

(I) Russian (Cyrillic):
U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A u+0438
Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l

(J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 u+0061 u+00F1 u+006F u+006C
Punycode: PorqunopuedensimplementehablarenEspaolfmd56a

(K) Vietnamese:
T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\
<ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t
U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068 u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067 U+0056 u+0069 u+1EC7 u+0074
Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g

The next several examples are all names of Japanese music artists, song titles, and TV programs, just because the author happens to have them handy (but Japanese is useful for providing examples of single-row text, two-row text, ideographic text, and various mixtures thereof).

(L) 3<nen>B<gumi><kinpachi><sensei>
u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
Punycode: 3B-ww4c5e180e575a65lsy2b

(M) <amuro><namie>-with-SUPER-MONKEYS
u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D U+004F U+004E U+004B U+0045 U+0059 U+0053
Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n

(N) Hello-Another-Way-<sorezore><no><basho>
U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
Punycode: Hello-Another-Way–fc4qua05auwb3674vfr0b

(O) <hitotsu><yane><no><shita>2
u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
Punycode: 2-u9tlzr9756bt3uc0v

(P) Maji<de>Koi<suru>5<byou><mae>
U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059 u+308B u+0035 u+79D2 u+524D
Punycode: MajiKoi5-783gue6qz075azm5e

(Q) <pafii>de<runba>
u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
Punycode: de-jg4avhby1noc0d

(R) <sono><supiido><de>
u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
Punycode: d9juau41awczczp

The last example is an ASCII string that breaks the existing rules for host name labels. (It is not a realistic example for IDNA, because IDNA never encodes pure ASCII labels.)

(S) -> $1.00 <-
u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 u+003C u+002D
Punycode: -> $1.00 <–

I hope this has been as enlightening for you as it has for me – I was unaware of PunyCode before today as well.

Chrome Language Detection

Google’s Chrome browser has a built in function for detecting the language of a website and offering a translation of the site if the language isn’t in your local language (and Google translates between those languages) – roughly 64 languages iirc.

Known as Compact Language Detection (CLD), it’s been extracted from the Open Source browser code base by blogger Mike McCandless, and ported into a stand alone product on Google code that can now be integrated into any c++ project, as well as some simple Python bindings.

It’s also not clear just how many languages it can detect; I see there are 161 “base” languages plus 44 “extended” languages, but then I see many test cases (102 out of 166!) commented out.  This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.

Excitingly, since it was first posted, Mike has a couple more posts on this library – this one details the addition of some Python constants and a new method removeWeakMatches and another that compares the accuracy and performance between CLD, and two java based projects: the Apache Tika project and the language-detection project:

Some quick analysis:

  • The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. Net/net these accuracies are very good, especially considering how short some of the tests are!
  • The difficult languages are Danish (confused with Norwegian), Slovene (confused with Croatian) and Dutch (for Tika and language-detection). Tika in particular has trouble with Spanish (confuses it with Galician). These confusions are to be expected: the languages are very similar.

When language-detection was wrong, Tika was also wrong 37% of the time and CLD was also wrong 23% of the time. These numbers are quite low! It tells us that the errors are somewhat orthogonal, i.e. the libraries tend to get different test cases wrong. For example, it’s not the case that they are all always wrong on the short texts.

This means the libraries are using different overall signals to achieve their classification (for example, perhaps they were trained on different training texts). This is encouraging since it means, in theory, one could build a language detection library combining the signals of all of these libraries and achieve better overall accuracy.

You could also make a simple majority-rules voting system across these (and other) libraries. I tried exactly that approach: if any language receives 2 or more votes from the three detectors, select that as the detected language; otherwise, go with language-detection choice. This gives the best accuracy of all: total 99.59% (= 16930 / 17000)!

Finally, I also separately tested the run time for each package. Each time is the best of 10 runs through the full corpus:

CLD  171 msec  16.331 MB/sec
language-detection  2367 msec  1.180 MB/sec
Tika  42219 msec  0.066 MB/sec

CLD is incredibly fast! language-detection is an order of magnitude slower, and Tika is another order of magnitude slower (not sure why).

Lictionary: shipping is everything

Yesterday I wrote about Lictionary, “a localization dictionary that presents information repository which is constituted by free softwares” and noted that one of it’s strengths was that it had shipped. What does that mean? In software and technology, it’s generally understood that “release early, release often” gives distinct advantages – failure comes earlier and easier, feedback loops with users are smaller and quicker, the software project will look “alive” to developers and users, and no uses software or sites that never made it out the door. There are plenty of articles about the advantages – here’s one by Matt Mullenweg, a founding developer of the software this site runs, WordPress.

Anyway – after the write up I gave them yesterday I popped over to their site to let them know what I thought and of the most glaring errors – the CDATA and the incorrect language attribution. Pleasingly there was a response in my inbox almost immediately from Türker at TSDesigns. The language attribution error has been fixed already (at least for Indonesian, I’ve not done any further testing), and they certainly are shipping – from what they have said, they only started collecting data last week:

We have started to collect our data last week. We are choosing and indexing a lot of repositories first time. After this first phase completed, we will identify problematic issues and eliminate these problems. CDATA problem is a sample of these situations. We are discussing about this. We can parse CDATA or skip them.

Manually choosing best translation is so hard. There are too many entries in system and there are several “best” translation for some strings in different contexts. So we added voting system. Translations are sorted by vote count. And we hide translations which have many negative votes. We will show best translations at the top with support of our users in the future. And also there is a trick in voting system. We add a positive vote to translations for each file. So mostly used translations have a head start.

Nothing makes me more excited than responsive developers. Can’t wait to see where this goes.

Unicode’s “right-to-left” override can be used to hide malware

Scary news for unicode – a very interesting attack vector has been discovered for those that want access to your information or computer – using the Unicode character U+202E, otherwise known as Right-to-left override or RLO:

this can (and is) also used by malware creeps to disguise the names of the files they attach to their phishing emails. For example, the file “CORP_INVOICE_08.14.2011_Pr.phylexe.doc” is actually “CORP_INVOICE_08.14.2011_Pr.phyldoc.exe” (an executable file!) with a U+202e placed just before “doc.”

This is apparently an old attack, but I’ve never seen it, and it’s a really interesting example of the unintended consequences that arise when small, reasonable changes are introduced into complex systems like type-display technology.

As is pointed out in the comments, Cory has made and error – the example file name he should have used was

CORP_INVOICE_08.14.2011_Pr.phylcod.exe

But really, that’s merely a semantic error. The issue has some very interesting side effects, although we will probably not be able to see them anymore, I can imagine they were cleaned up quite quickly:

I copied the program that powers the Windows command prompt (cmd.exe) and successfully renamed it so that it appears as “evilexe.doc” in Windows. When I tried to attach the file to an outgoing Gmail message, Google sent me the usual warning that it doesn’t allow executable files, but the warning message itself was backwards:

“evil ‮”cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this type of file.

The most interesting thing here is something I’ve only just discovered as a result of writing this post. Note that the “backwards” writing I’ve mentioned above is actually different from the text in the original article?

The actual Google warning is this:

evildoc.exe is an executable file. For security reasons, Gmail does not allow you to send this type of file.

Original article has this backwards:

“cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this live” type of file.

Cory has this backwards:

“cod.exe is an executable file. For security reasons, Gmail does not live” allow you to send “this type of file.

Somewhere in the process that the author went through, another invisible character was added to his text: the U+202C, or “pop directional formatting” character, and the wrapping that is involved in the quoting process has started to mess things up. I wonder if the addition of that is dictated by the Unicode standard or whether it was done by the OS (Windows, one would presume) and if done by the OS whether the functionality was weighed up or if it was just used because it worked without considering the outcome? Or was it added by Google?

U+202E

This character is one of the layout controls (pdf) – all of which are invisible operators – that allow bi-directional text. There are seven characters in this group – to provide embedding of bi-directionality up to 61 levels deep:

Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسملة”) (translated into English as “Bismillah”) right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right. However, support for bidirectional text becomes more complicated when text flowing in opposite directions is embedded hierarchically, for example if one quotes an Arabic phrase that in turn quotes an English phrase. Other situations may also complicate this, such as when an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides seven characters (U+200E, U+200F, U+202A, U+202B, U+202C, U+202D, U+202E) to help control these embedded bidirectional text levels up to 61 levels deep.

From the Mapping of Unicode characters wikipedia entry, we can see what the function of each of these characters is:

The render-time directional type of a neutral character can remain ambiguous when the mark is placed on the boundary between directional changes. To address this, Unicode includes two characters that have strong directionality, have no glyph associated with them, and are ignorable by systems that do not process bidirectional text:

  • Left-to-right mark (U+200E)
  • Right-to-left mark (U+200F)

Surrounding a bidirectionally neutral character by the left-to-right mark will force the character to behave as a left-to-right character while surrounding it by the right-to-left mark will force it to behave as a right-to-left character. The behavior of these characters is detailed in Unicode’s Bidirectional Algorithm.

While Unicode is designed to handle multiple languages, multiple writing systems and even text that flows either left-to-right or right-to-left with minimal author intervention, there are special circumstances where the mix of bidirectional text can become intricate—requiring more author control. For these circumstances, Unicode includes five other characters to control the complex embedding of left-to-right text within right-to-left text and vice versa:

  • Left-to-right embedding (U+202A)
  • Right-to-left embedding (U+202B)
  • Pop directional formatting (U+202C)
  • Left-to-right override (U+202D)
  • Right-to-left override (U+202E)

 

Data Collations

This week, while working on the Windows on Australia project I was trying to solve a problem we have with sorting of our Unicode data, specifically on the Translators and Target Text pages. Wanting an alphabetical listing, I’ve found that, and you can see on those two pages, that there was an issues with some of the representations – I was getting results that were sorted not incorrectly, but strangely: A Á A Á A Á A and キ き キ き キ are an example from each page. As you can see from these examples, the sorting is actually based on the order they were entered into the database, rather than a strict alphabetisation.

As a result, I’ve finally got a greater understanding of how difficult it is to sort pure Unicode data, and what a Collation is in a database (in this case, MySQL).

Storing the data isn’t a problem – MySQL does this as you would expect using what’s known as a Character set – essentially a variable that describes what type of data a database can store. Unicode means that the database will take more space, as each character requires more information to distinguish it from all the other possibilities – if you use ASCII you only need to distinguish 127 characters. If you use Unicode (UTF-8 in this case) you have access to over 109,000 different characters in 93 scripts, but it takes significantly more space to record them. There is some very elegant theorising behind the scenes, that I wont go into in depth, to reduce how much space they do take up, and the recent increases in available memory have pretty much made using unicode over ASCII a moot point in any setting, not just one that would require a diverse character set.

The collation is how the database knows in what order the data is to be sorted. There is a long and difficult description of a generic unicode sorting algorithm, but implementation is more difficult than description.

Windows on Australia uses utf-8 to record the data, and utf8-unicode-ci as it’s collation (the ci stands for “case insensitive”). The difference between collations is best described here but essentially it comes down to how different languages sort their own scripts and the inconsistencies across those languages. Here is the best example, from the generic unicode sorting algorithm paper I linked to above, in its description of the problem:

Language Swedish: z < ö
German: ö < z
Usage German Dictionary: of < öf
German Telephone: öf < of
Customizations Upper-First A < a
Lower-First a < A

As you can see, there are inconsistencies even within the German language usage.

The end result being, because of Windows on Australia’s potentially broad data set – it may contain both Swedish and German at some point, we will have to live with problematic sorting of data in presentation. I can imagine a solution but it’s complex – Django, the framework that WoA is written in, falls back on the database ordering. This would require me to grab the locale of the browser accessing the site, then writing some specialised python wrapped around SQL queries in the site code – not a quick, easy or elegant solution unfortunately.

If anyone has any other tips on what I can do to remedy this, I’d love to hear them.

Teaching Translation Technology

Next week I am giving a lecture to a Masters level course on Translation and Technology. I did the same lecture last year – designed to be a general overview of how these two fields intersect, there’s not a lot left over after 100 minutes and questions.

When I was trying to organise myself for last year, I did a mind map to put everything in perspective. Once completed, I decided to jam from the resulting image rather than follow any notes, providing each student with a paper copy on which to take their own notes and posted a digital copy on their subject webspace.

A year on, and the mind map stands up well, although I’ve made quite a few adjustments over the year, and to be honest I’d prefer to be working in something like Prezi than Xmind to get this done. Prezi looks great – the drill down function it provides is sorely missing from Xmind, but it really takes good design foo to stop motion sickness (egs: bad, best). I’m not sure what the Hungry Beast crew used for this presentation on Google, flash I presume, but I don’t have the requisite skills to be that good.

Finally, with neither Xmind nor Prezi being FLOSS (free crippleware for home users la la), and Dia being, well, dire (TODO: post on the software ecosystems that FLOSS still struggles in; OCR, mindmap/presentation/UML, ???), I’ve decided for the moment to stick with Xmind.

The image itself is a construct, and has all the resultant flaws that any construct does – it reflects the author’s interests and perspective. I’m interested in what people think I’ve missed out or how it could be improved. As with any mind map, it can be hard to group like with like, and my Relationship links are deliberately sparse – it’s very easy to regress to the spider’s web if you are not careful. I like Xmind’s intelligent grouping function, but I would love to have the ability to place things manually – Machine Translation closer to TEnTs for instance.

Technology and Translation Mind Map

Technology and Translation - a mind map

Click through for the full image, if enough people clamour for it, I’ll work out how to post the .xmind file and add it to the post.

Why i18n and L10n?

Internationalisation and Localisation are often referred to as i18n and L10n. Once you know why, it seems so obvious. Originally developed at Digital Equipment Corporation in the 70s or 80s, Wikipedia claims that:

The capital L in L10n helps to distinguish it from the lowercase i in i18n.

I would also suggest that it has a lot to do with one of the most popular fonts available to programmers and computer users at the time: Courier (Wikipedia), designed by Howard Kettler (bio). As you can see from this picture, the 1 (one) and the lower case l (L) are easily mistaken: i18n  v I18n; L10n v  l10n

i18n and Bash scripting

One of the first things a programmer learns, one of a sysadmin’s main tools and working environments is known as the Bourne Again Shell, or bash.

Bash is a low level programming environment, similar to DOS on the Windows OS. Bash comes with Mac OSX (Utilities->terminal) as well, but like DOS, is rarely used by most people. Within bash it is possible to do very complicated tasks very quickly by stringing together a collection of smaller tools like findingsorting, slicing and searching to automate the task at hand. The Flossmanuals have an excellent introduction to the command line (which is usually, but not always, bash) and author Neal Stephenson wrote an excellent piece called In the Beginning was the Command Line (it’s getting long in the tooth, but is well worth the effort).

Louis Iacona at the Linux Journal gives a great summary on how to Internationalise our bash scripts. As he notes, most other languages have well developed documentation regarding i18n/L10n – but not bash. He offers no reason why, although I would suggest it’s due to age and that most people use it for small repetitive tasks rather than complex software engineering.

The article begins with a rudimentary intro to i18n/L10n but the meat is further down, where he steps through the processes and commands that are used to i18n a bash script.

While it’s not really ground breaking, after 23 years, bash deserves an i18n howto. A great contribution.