When the Internet, and computers for that matter, were first being developed, no one thought so far ahead as to add language neutrality – everything was in English. That’s not entirely true – it was all in ASCII, but is a story for another day. I presume the thinking either just wasn’t there, or it was considered a problem to be addressed when it arose. I find it hard to be overly critical of this approach – while it is culturally insensitive and US centric, no one had any idea what massive changes were about to be unleashed on to the whole planet as a result of their research.
The situation has come a long way since then – from a thriving Internationalisation and Localisation industry to the subtitling underground, the world is now online.
The one thing missing of course, were foreign (ie, non English alphabet) characters in URLs or domain names, but
(i)n late 2009, the Internet Corporation for Assigned Names and Numbers (ICANN) approved the creation of internationalized country code top-level domains (IDN ccTLDs) in the Internet that use the IDNA standard for native language scripts.
Essentially, non ASCII scripts like Chinese characters and Arabic scripts, we approved in URLs. A great day for an international internet – with Egypt, the Russian Federation, Saudi Arabia, and the United Arab Emirates being the first countries to have the opportunity. Of course, there was one remaining problem – that the whole software stack between bare metal and the browser was written expecting ASCII characters, to re-write it all would take an unacceptable amount of time and the perceived increase in complexity would potentially make some quite svelte software overbloated. Computer Scientists ended up doing what they do best – route around the problem.
This is where PunyCode comes in – it’s designed to map the Unicode character set into ASCII for lower level software to understand. It uses complex mathematics to do this that you are welcome to attempt to understand in the three main RFCs that address this issue: 3492: Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications, 5891: Internationalized Domain Names in Applications (IDNA): Protocol, and the less useful to non-nerds than it sounds 5894: Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale.
Luckily, that’s what I’m here for – although it should be note that RFCs adhere to templating standards that put almost every other academic journal on the planet to shame – computer scientists and engineers have this thing about exactness I guess.
I’ll start with the a little of the backgrounding, with examples following.
Users have expectations about character matching or equivalence that are based on their own languages and the orthography of those languages. These expectations may not always be met in a global system, especially if multiple languages are written using the same script but using different conventions. Some examples:
o A Norwegian user might expect a label with the ae-ligature to be treated as the same label as one using the Swedish spelling with a-diaeresis even though applying that mapping to English would be astonishing to users.
o A German user might expect a label with an o-umlaut and a label that had “oe” substituted, but was otherwise the same, to be treated as equivalent even though that substitution would be a clear error in Swedish.
o A Chinese user might expect automatic matching of Simplified and Traditional Chinese characters, but applying that matching for Korean or Japanese text would create considerable confusion.
o An English user might expect “theater” and “theatre” to match.
Some examples pulled from RFC3492:
7.1 Sample strings
In the Punycode encodings below, the ACE prefix is not shown. Backslashes show where line breaks have been inserted in strings too long for one line.
The first several examples are all translations of the sentence “Why can’t they just speak in <language>?” (courtesy of Michael Kaplan’s “provincial” page [PROVINCIAL]). Word breaks and punctuation have been removed, as is often done in domain names.
(A) Arabic (Egyptian):
u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644 u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
(B) Chinese (simplified):
u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
(C) Chinese (traditional):
u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
(D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute <ccaron>esky
U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D u+0065 u+0073 u+006B u+0079
u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8 u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2 u+05D1 u+05E8 u+05D9 u+05EA
(F) Hindi (Devanagari):
u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939 u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947 u+0939 u+0948 u+0902
(G) Japanese (kanji and hiragana):
u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092 u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
(H) Korean (Hangul syllables):
u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
(I) Russian (Cyrillic):
U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A u+0438
(J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 u+0061 u+00F1 u+006F u+006C
U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068 u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067 U+0056 u+0069 u+1EC7 u+0074
The next several examples are all names of Japanese music artists, song titles, and TV programs, just because the author happens to have them handy (but Japanese is useful for providing examples of single-row text, two-row text, ideographic text, and various mixtures thereof).
u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D U+004F U+004E U+004B U+0045 U+0059 U+0053
U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059 u+308B u+0035 u+79D2 u+524D
u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
The last example is an ASCII string that breaks the existing rules for host name labels. (It is not a realistic example for IDNA, because IDNA never encodes pure ASCII labels.)
(S) -> $1.00 <-
u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 u+003C u+002D
Punycode: -> $1.00 <–
I hope this has been as enlightening for you as it has for me – I was unaware of PunyCode before today as well.