Regular Expressions

One of the things I like most about the OmegaT mailing list is it’s range of topics – from the simplest of “how do I ..?” questions, right through to some quite complex computer science concepts.

Recently someone posted a link to a blog post about how regular expressions (aka regexs, regexps, RE) were the most valuable tool in their translation toolbox. REs aren’t a particularly difficult concept to grasp – essentially, it’s how we search for a string of characters in a larger body of text. The harder part is in learning how the little bits work together, the hardest part is learning about concepts like greedy vs non-greedy and how to avoid them.

Regular expressions are a “search” function on steroids. Regular expressions were created to find patterns in strings. They can find simple patterns like the word “pattern” in this text, or more complex patterns like “a string that starts with ‘pa’, followed by a letter that’s repeated twice, followed by any three characters that are not ‘space’ or ‘@’ or ‘^’ and followed by a space”.

A Tao of Regular Expressions is considered a seminal work – and it’s beauty is in its brevity – a few simple rules, some examples ranging from obvious to quite difficult but showing the power of RE and finally explaining some of the tools that use REs.

Of course, translators use REs all the time without necessarily knowing that they are – the segmentation of texts, and later concordance, is all based on REs.

As an aside, the Mac for Translators blog is fantastic – written by an actual translator, it is vaguely in the same vein as this one, but often tends more toward the technical and the translation, with less pop culture. Give it a read.