Google’s Chrome browser has a built in function for detecting the language of a website and offering a translation of the site if the language isn’t in your local language (and Google translates between those languages) – roughly 64 languages iirc.
Known as Compact Language Detection (CLD), it’s been extracted from the Open Source browser code base by blogger Mike McCandless, and ported into a stand alone product on Google code that can now be integrated into any c++ project, as well as some simple Python bindings.
It’s also not clear just how many languages it can detect; I see there are 161 “base” languages plus 44 “extended” languages, but then I see many test cases (102 out of 166!) commented out. This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.
Excitingly, since it was first posted, Mike has a couple more posts on this library – this one details the addition of some Python constants and a new method removeWeakMatches and another that compares the accuracy and performance between CLD, and two java based projects: the Apache Tika project and the language-detection project:
Some quick analysis:
- The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. Net/net these accuracies are very good, especially considering how short some of the tests are!
- The difficult languages are Danish (confused with Norwegian), Slovene (confused with Croatian) and Dutch (for Tika and
language-detection). Tika in particular has trouble with Spanish (confuses it with Galician). These confusions are to be expected: the languages are very similar.
language-detectionwas wrong, Tika was also wrong 37% of the time and CLD was also wrong 23% of the time. These numbers are quite low! It tells us that the errors are somewhat orthogonal, i.e. the libraries tend to get different test cases wrong. For example, it’s not the case that they are all always wrong on the short texts.
This means the libraries are using different overall signals to achieve their classification (for example, perhaps they were trained on different training texts). This is encouraging since it means, in theory, one could build a language detection library combining the signals of all of these libraries and achieve better overall accuracy.
You could also make a simple majority-rules voting system across these (and other) libraries. I tried exactly that approach: if any language receives 2 or more votes from the three detectors, select that as the detected language; otherwise, go with
language-detectionchoice. This gives the best accuracy of all: total 99.59% (= 16930 / 17000)!
Finally, I also separately tested the run time for each package. Each time is the best of 10 runs through the full corpus:
CLD 171 msec 16.331 MB/sec
2367 msec 1.180 MB/sec Tika 42219 msec 0.066 MB/sec
CLD is incredibly fast!
language-detectionis an order of magnitude slower, and Tika is another order of magnitude slower (not sure why).