Regex… and "economy of accuracy" (Regular Expressions – Part 2)

In Regular Expressions – Part 1 I wrote a summary of where regular expressions could be used in SDL Trados Studio, and I covered a couple of examples.  I also referred to RegexBuddy quite a lot as this is a really useful tool in helping you write and understand regular expressions.  But in case learning another application is something you don’t want to do I thought it would be handy to go through what I think are the most useful applications of regular expressions for every day use in SDL Trados Studio, and also share a few tips on how to use Studio to verify the expressions are finding what you need as well as introduce a little “economy of accuracy“.
So first of all, and an important thing to note that takes away much of the mystery of regex is to look at what is actually happening.  It is basically a search tool, so if I wanted to find the word “multifarious” using regex amongst these sentences:

I don’t have to do anything complicated, I simply put “multifarious” into the display filter like this:

This is because all regex does is to search systematically through the content of each segment looking for a match.  So in this case it looks for the letter m, then when it finds it it looks for the letter u and so on until it matched the whole word.  If it failed at any point then the display filter would not display the segment.
Now, the interesting thing about regex is that it also comes with a number of special patterns you can use to make your search more powerful.  Let’s start with something called an anchor.  There are two anchors, the ^ (caret) symbol and the $ (dollar) symbol.  These symbols don’t match anything at all, they just tell Studio where to anchor the position of the search.  So for example, if I put a caret at the start I see this where segment #43 is no longer visible:

This is because the caret says, start searching from the start only.  Segment #43 starts with an I so the search fails immediately.  If I put a dollar at the end instead I see this:

This time segment #41 is not visible, but #43 is.  This is because now Studio is only looking for the case where the word multifarious is the last thing in the segment.  If I wanted to see only segments that contained multifarious and nothing else I could use the caret and the dollar at the same time, now I see this:

That was pretty easy… but what happens if I want to find something containing the caret symbol?  This is a good time to introduce the range of characters that have a special meaning and need to be escaped in the regex expression if you want to find the literal character in the text.  There are eleven characters that have a special meaning, and once you understand them regular expressions start to take on a whole new meaning… and almost addictive quality (… or maybe that’s going a little too far ;-)):
^  $  [    .  |  ?  *  +  (  )
I’m not going to cover what all of these are for in this blog, but it’s important to understand that you have to escape them if you are looking for them as literal text.  To escape them you simply place a backslash () before the character in the regex expression.  So for example, if I wanted to find $2.45 in a list like this $2.67, $2.72, $3.37, $2.45, $2.67, $2-45, $2.92, $2145, $2,45 and I just type in the literal characters I get this no match:

The first reason is because the dollar means anchor at the end of the string and then start looking.  Of course there is nothing after the end so the search pattern can’t find anything.  So I have to escape this special character and now I see this:

This time  I have a few results, but still not what I need.  This is because the dot is a special character that means match anything at all, apart from a line feed character (or hard break).  So it matches the dot, the hyphen, the one and the comma.  So to be precise I need to escape the dot as well like this:

You probably see these backslashes used a lot in regular expressions and they can have the effect of confusing the heck out of you..!  But hopefully after reading this you’ll see it’s not so hard and when used for the simple sort of search operations, that are still extremely powerful and useful, you might do in Studio they’re quite easy to learn as well.
Another thing you might hear people referring to are “character classes/sets”.  These are used to match only one out of several options.  So for example, if you wanted to be able to find all segments containing the word localisation, or where it was spelled localization, you can do this using a single search containing a character set:
locali[sz]ation
So by putting the square brackets (note the opening square bracket was also a special character so needs to be escaped if you want to search for the opening square bracket as a literal text) around the letters s and z within the word Studio knows it should look for words that contain either s or z but not both.

A quick tip here is that Studio can be quite helpful when you are trying to get your regex expression perfected.  If you use the search facility rather than the display filter then the specific terms that are found are highlighted in the editor and this helps you to make sure all the things you want are picked up correctly:

You then hit the “Find Next” button and get this:

You can quite quickly see whether the search is going to find the correct things and then put it back into the display filter if you trying to filter rather than find specific terms.
This ability to use a character class like this might be quite useful as a QA check for example (another place you can use regular expressions in Studio).  You can build up expressions of your own that check the document for commonly mispelled words, or simply to help ensure consistency throughout your document, before or after translation.  “Before” is quite useful as you have an opportunity to make sure the correct translation will be used where the mispelling could lead to to subtle changes in the meaning:

Perhaps not the greatest example of a QA check, but I wanted to illustrate the possibility with the same expression we just learned… it would then show up in Studio as you worked like this so you knew there was a possible misspelling to be aware of through the small warning symbol that is shown in the status column:

Another useful thing about character sets are what ‘s known as “shorthand” character sets.  So for example, if you wanted to find any number between 0 and 9 you can use this:
[0-9]
Or if you wanted to find letters, you can use these:
[a-z]
[A-Z]… or even
[a-zA-Z]
Where the first one looks for any letter between a to z as a lowercase letter.  The second looks for uppercase letters, and the third doesn’t care whether they are uppercase or lowercase.  You have the same concept for numbers like this:
[0-9] [2-5] [3-8]
Here the regex expression will try to match a number within the range specified, so 0 to 9, 2 to 5 or 3 to 8.  These two patterns for finding any number or letter are very useful and well worth becoming familiar with them.  If you are looking for Product codes for example that follow a particular pattern then you can use these to build up the patterns.  For example I could do this:

In in reality this is not a bulletproof regex because it will find anything that starts with a capital letter followed by a number.  But showing you this as an example also allows me to illustrate another useful point.  As a Translator using Studio the phrase “economy of accuracy” is a good one to adopt.  We’re not writing computer code here that needs to be very specific… we are just trying to use regex to make our lives a little easier and find the things we need to address.  So in this example, based on all the other content of my document this simple expression was enough to find exactly what I needed.  So why would I spend time trying to write something like this that would also achieve the same result?
[A-Z]d{2}[A-Z]+d{1}.d{1}
So, “economy of accuracy” is a good term to remember.  But also a word of caution… think of the consequences when being economic.  If you are filtering or just searching then this principle applies perfectly.  If you are replacing text throughout an entire document then the more accurate you can be the better… in fact when you are replacing you may have to be accurate depending on what you are doing.  But I’ll cover search and replace in part 3.
To finish this article off I want to mention another useful expression I used in the more complicated regex above:
d{2}
The d is another way of writing [0-9].  Technically they mean different things, but for our purposes they do the same thing.  The 2 in curly brackets means find me the number exactly 2 times.  You can also use something like this:
d{2,4}
Match a number between 2 and 4 times.  Now these are really useful for handling number based patterns like IP addresses and dates that can be autolocalised by Studio when you don’t want them to be.  So the idea would be that you find the segments with these numbers in them, lock them and safely prevent them from being changed.  So consider the following where I used d{3}.

Clearly this regex isn’t accurate enough yet to serve my purpose because I only want to see segments that only contain IP addresses.  But by inspecting at the content of this file it looks as though I can resolve the problem with a little “economy of accuracy” by adding a caret to the start:

I can now easily lock all of these segments in one go and then translate without worrying about them being automatically translated incorrectly because dots have become commas or spaces for example depending on the language I am translating into.  Certainly this is easier than using an expression like this where the checks also validate the IP numbers themselves… as a translator or project manager you may not care whether they are valid or not..!:
^(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])$
There are many more things to learn about using regular expressions in Studio, and if you do find them interesting and have other applications where you can use them then there are many more things to learn generally..!  But I hope I have managed to achieve my main objective of removing the wizardry a little, explaining “economy of accuracy” and how this principle will allow you to have a go with regex in Studio yourself and become more productive as a result.
There are a couple of other articles that are worth reviewing;
Regex….. what regex..! (Regular Expressions – Part 1)
Search and replace with Regex in Studio – Regular Expressions Part 3
DOGS and CATS… Regular Expressions Part 4!

22 thoughts on “Regex… and "economy of accuracy" (Regular Expressions – Part 2)

  1. Thanks for this article, Paul. It couldn’t be clearer, especially with your example sentences to explain the anchors. Practising is the key to getting to grips with Regex, like learning any language, I suppose.
    Emma

    1. Thanks Emma. You’re right… I’m no expert myself but it’s amazing what you can accomplish with just a few simple rules and a little practice.

  2. Hi Paul, great series so far on regex usage in Studio. One question though: Why does the validating regex for IP addresses use non-capturing groups? Is that needed in Studio, a performance improvement or something else?
    Also, I’d recommend the following version of that regex for simplification:
    ^(?:(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}]|[1-9]?[0-9])$
    The alternate that matches 100 to 199 can be written as 1[0-9]{2} instead of 1[0-9][0-9], makes it a little more obvious as to what the intent is.
    Finally, whilst you’re obviously a keen user of RegexBuddy, I tend to go with RegexHero (http://regexhero.net/tester/), a Silverlight-based .NET regex designer. It’s always been a little easier to get to grips with than RegexBuddy for me. Admittedly, it’s a nag-ware app (it pops up every few minutes to ask you to buy it) but if you aren’t using it for very long, it’ll do OK.
    Rob

    1. Hi Rob, I guess there are lots of ways to validate an IP address, so nice to see your suggestion. The first non-capturing group is there so you can take what’s in the group and apply it three times… but you’re right in that I could have omitted the second one and it’ll still work. I don’t have a file large enough and full of IP addresses to know whether one performs better than the other… in fact the one I used may be slower with the additional task to perform. But actually my use of regex is all about “not” having to get this technical..! So d{3}. serves its purpose 🙂
      I had a quick look at regexhero… looks handy and it’s free. So thank you for the reference on this page. I think for me, as someone who is not as expert as you, regexbuddy may be better, particularly because of its ability to break down more complicated expressions into understandable English..!
      Cheers
      Paul

    1. Glad you think so Tommy! If you use SDLTMConvert from the OpenExchange and convert the SDLTM to XLIFF then yes… but you can’t use these in the TM Maintenance editor which is what you probably mean.

  3. Hi, Paul,
    Thanks for the nice write-up on using finite automata in SDL Trados Studio. Do you by chance know how do we use regular expressions in replace string in Passolo?
    We need to a find an apostrophe in an Unicode alpha-numeric string and add an extra backslash before the apostrophe (this is necessary to escape the apostrophe symbol for that text).
    Example:
    The source string would be: TecтTest’ TestТест
    Goal: TecтTest’ TestТест
    Obviously, the find expression should look like a combination of three classes:
    ([p{L} p{Nd}])(‘)([p{L} p{Nd}])
    where
    ([p{L} p{Nd}]) — 1st and 3rd class of characters that includes all alphanumeric characters.
    (‘) — the 2nd class that includes an apostrophe character only.
    So that the replacement expression would be: $1$2$3
    where $n is nth class of characters.
    Running a search’n’replace procedure in a special tool would give us: TecтTest’ TestТест
    So here goes the issue with Passolo. In Passolo it is NOT supported (to my knowledge) to use regular expressions in replace string. The application searches for the string using regular expression and replaces in with a flat sequence of characters. For a given example we’d get:
    TecтTes$1$2$3 estТест
    What it does, it cuts off by 1 from the neighboring characters with the apostrophe and replaces the sequence with $1$2$3!
    Indeed instead of interpreting the replacement string Passolo treats it just as a text. What a whacky behavior!
    Any clue on fixing that? How would it be possible to do such a replacement in Passolo?
    Thank you.

    1. @Exotic Hadron: Some answers and additional tips:
      I was not able to reproduce your problem. Maybe there are some differences in what apostrophe is in your text and expression and what is displayed in the web site of the blog. SDL Passolo IS supporting back references in the replace expression. I did a simple test replacing
      (]+)(>)
      with
      $1NEW$2NEW$3
      and it works without problems as expected without inserting any unexpected literal text from the expression into the translation.
      Maybe you can simplify your expression so that is only exchanges the apostrophe. Another solution depends on the file format. If you are localizing text files you can simply avoid the problem by inserting a mapping into the parser rule that automatically exchanges ‘ with ’ when the target file is generated.

      1. The blog is eating my expression. I’m missing a preview button to check what will be displayed. I’m inserting some entities. Please convert them back to characters:
        (<)(^>]+)(>)

        1. Hello, Achim,
          Thank you for your response here and the one over at Proz.
          Well, the main issue there is that in your example you, admittedly, have a simple automata that is seeking for ASCII characters that match tags. I don’t know if it is the case but you’d agree, escaping within a Unicode literal text *could* be much quirkier for Passolo than just adding a class+literal what you do in your example. Your expression is way simpler to interpret, so that possibly could be the issue, but does not necessary is.
          At least we’ve experienced issues with using Unicode expressions where ASCII reexps seemed to have worked flawlessly.
          The good news is that we’ve re-run our tests today (step by step), and you know what? It worked. We’d possibly file an issue case to SDL if we face the issue with non-working Replace features further in.
          Thank you for your support and help.

  4. A suggestion I have found useful in the past when working with regular expression: keep a text file with a collection of useful regular expression searches. I keep a file with the regex expression, followed by a description of what the expression does; this is for not reinventing the wheel every time I have to do a moderately complicated regex search.

    1. Good idea… I do the same thing. I have a spreadsheet with different tabs for the different usecases. So I have tabs for display filter, normal search, xml filetypes, embedded xml/html, XPath, regex filetypes, QA rules, SQL statements (for SDLTmConvert) and Terminjector rules. Very handy tip! Then I also keep it backed up regularly as I’d hate to lose it 😉

  5. For e.g.:
    <trans-unit id="19.30-attr-href"
    <trans-unit id="27.37-attr-hre"
    Is there a way to express the above digits to regx?

    1. I need a little more context here. Have you extract this attribute for translation in an XML file and now want to handle the number in some way?

  6. I want to create inline tags for some CAD files extracted to text using TranslateCAD and I want to prevent translation of any string that doesn’t have letters in. So, for example, strings like 95.937 or 3939. or 15-1455.
    I have tried ^(?=[^A-Za-z]+$).*[0-9].*$ which I found on another website but Studio didn’t allow the tags.

    1. Hello Malcolm, regex should really be specific to the problem you want to solve rather than look for solutions on a website… probably easier this way too! Maybe something like this will do the trick:
      [^sa-zA-Z]+
      This will basically look for anything that is not a space or a lowercase or uppercase letter.
      Your expression will specifically start at the beginning of the segment and look for letters and numbers until the end… so is definitely not what you are after. I’d recommend you take a little time to learn a few simple basics… it’s a lot less complicated than you might think.

  7. Hi, I’m trying to search for this:
    ([a-z]
    (a opening bracket followed by a lowercase letter), but the results contain bracket following uppercase letters too.
    What have I done wrong?
    Thanks!

    1. Try checking the “match case” box as the normal search has this option which I think may override the regex. I can’t test this at the moment so I’m working from memory… but it may be something like that.

  8. This was an excellent introduction to regular expressions, thanks Paul. I understood and remember the syntax better than with other explanations thanks to the practical examples in Studio and RegexBuddy.

    1. Excellent… glad you found it useful. We have a forum dedicated to refer and xpath in the SDL Community. Worth a look if you get stuck or just want to see what others are doing.

Leave a Reply