Data Collations

This week, while working on the Windows on Australia project I was trying to solve a problem we have with sorting of our Unicode data, specifically on the Translators and Target Text pages. Wanting an alphabetical listing, I’ve found that, and you can see on those two pages, that there was an issues with some of the representations – I was getting results that were sorted not incorrectly, but strangely: A Á A Á A Á A and キ き キ き キ are an example from each page. As you can see from these examples, the sorting is actually based on the order they were entered into the database, rather than a strict alphabetisation.

As a result, I’ve finally got a greater understanding of how difficult it is to sort pure Unicode data, and what a Collation is in a database (in this case, MySQL).

Storing the data isn’t a problem – MySQL does this as you would expect using what’s known as a Character set – essentially a variable that describes what type of data a database can store. Unicode means that the database will take more space, as each character requires more information to distinguish it from all the other possibilities – if you use ASCII you only need to distinguish 127 characters. If you use Unicode (UTF-8 in this case) you have access to over 109,000 different characters in 93 scripts, but it takes significantly more space to record them. There is some very elegant theorising behind the scenes, that I wont go into in depth, to reduce how much space they do take up, and the recent increases in available memory have pretty much made using unicode over ASCII a moot point in any setting, not just one that would require a diverse character set.

The collation is how the database knows in what order the data is to be sorted. There is a long and difficult description of a generic unicode sorting algorithm, but implementation is more difficult than description.

Windows on Australia uses utf-8 to record the data, and utf8-unicode-ci as it’s collation (the ci stands for “case insensitive”). The difference between collations is best described here but essentially it comes down to how different languages sort their own scripts and the inconsistencies across those languages. Here is the best example, from the generic unicode sorting algorithm paper I linked to above, in its description of the problem:

Language Swedish: z < ö
German: ö < z
Usage German Dictionary: of < öf
German Telephone: öf < of
Customizations Upper-First A < a
Lower-First a < A

As you can see, there are inconsistencies even within the German language usage.

The end result being, because of Windows on Australia’s potentially broad data set – it may contain both Swedish and German at some point, we will have to live with problematic sorting of data in presentation. I can imagine a solution but it’s complex – Django, the framework that WoA is written in, falls back on the database ordering. This would require me to grab the locale of the browser accessing the site, then writing some specialised python wrapped around SQL queries in the site code – not a quick, easy or elegant solution unfortunately.

If anyone has any other tips on what I can do to remedy this, I’d love to hear them.