Regular Expressions – Part 1

Regular Expressions, often referred to as Regex, are something that come up again and again in forums, roadshows and the occasional questions.  So I thought it might be useful to take a better look at them and how they can be useful for translators.  To begin with I’m republishing a blog article I wrote a year or so ago on a different site so I can build on this theme in one location.

Regex….. what regex..!

What regex… this one of course! You can resolve your problem easily with the following expression;

\b(0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])[- /.](19|20)?[0-9]{2}\b

Now, I know most of you will instantly recognise this as an expression to find invalid date formats. But for those of you who don’t, and who have heard the phrase “Just use this Regex Expression.”, let’s take a closer look at how they can be useful.

Regex is shorthand for Regular Expression and regex is just a set of rules that describes a pattern of text so that you can find that text in your documents or in your translation. I’m not intending this to be a tutorial in how to write regex expressions but I would like to describe where you can use them in our products and where you can get help. I’m very fortunate in that I have a “Patrik” and many other clever developers who can write these expressions standing on their heads, but when they are fed up with me asking the same question over and over again I refer to RegexBuddy. There are probably lots of good tools and reference sites for regex and I’m not promoting this one for any reason other than I find it easy to use. This application is really very useful because it can help you create expressions, test them out, and explain what each part of the expression is doing. It also has a library of expressions (the one at the top came from there) you can draw upon as well as the ability to get help from some real experts in a forum that works from within the application itself.

Regular Expressions can be used in all of SDL Language Technologies products so learning even the simplest of expressions can be really useful. For example, I have a document containing many number only segments and I only want to see the text that needs to be translated because Studio automatically translates the numbers for me. I can use the following regex in the display filter to remove all the numbers from view;

[^0-9.,]

If I enter this into the display filter like this;

Then this document transforms as follows;

The question is, how would I know what [^0-9.,] means in the first place? This is where RegexBuddy comes in. First I can ask it to insert a token (the term used to describe the regex operators) to match what I need, in this case I don’t want it to match numbers. So the first part is this;

Then I specify what should not match in the same window. In this case digits between 0 and 9, and decimal points and commas;

This creates the expression to use for me, and also gives me a handy explanation of what this all does. This is really helpful because the more I use it, the more I start to understand how this works..!

Let’s look at a slightly trickier example. I have a file that requires translation and all the translatable text is between quotes like this;

$AB1 = "Connection on"
$AB2 = "Connection off
$FWL = "Firewall"
$PWH = "Always on"
$TestPlat = "Safe Environment"
$Watcher = "Network monitor"
$Source = "Source:"
$Target = "Target:"
$Protocol = "Protocol:"
$OtherNet = "Other networks"
$BrowseForExec = "Executable group selection"

I can create a simple regex filetype in Studio to pick this out by telling it to look for a simple pattern that finds all text in each row that sits between the quote marks. In Studio it would look like this;

Resulting in this nice and clean rendition for translation with no danger of incorrectly changing any of the code (copy and paste the code above into a txt file and give it a try);

In Regex Buddy it was easy to test and create like this. First the Opening Pattern;

^
This just tells Studio to start searching at the beginning of the string of text

.
The dot is the pattern that regex knows matches any character apart from line breaks

*
The star tells the regex engine to keep looking for any character (remember the dot) as many times as needed

?
The question mark is there to stop the star from continuing to search after it’s found the first result (referred to as lazy instead of greedy)

(“)
The brackets tell Studio to find the quote symbol and then remember that it found it (referred to as a back reference)

I can test this in RegexBuddy like this;

Note: Something I missed out from the original article.  This is the importance of making sure you select the correct type of Regex.  There are many forms and Studio (as this is what I’m showing here) uses .NET.  The great thing about RegexBuddy is it supports them all, so once you get the hang of using Regular Expressions you’ll find other applications where they’ll be useful too.

And of course the explanation like this;

Next the Closing Pattern;

(“)$

Here we know what the (“) is for so the new part is the $ which just means start at the end of the string. In RegexBuddy this tests like this;

So, enough with the examples already… where is all this applicable in Studio for example?

  • QA checker
  • Regex filter (or txt filter)
  • embedded html content inside and xml file
  • specifying filetypes that should use a particular filter
  • inline and external tags in various filetypes
  • segmentation rules
  • filtering segments in the editor view
  • searching in the editor view
  • search and replace in the editor view
  • display filter in the editor view

Every product we have, SDL Author Assistant, SDL MultiTerm, SDLX, SDL Trados 2007, SDL Trados Studio etc. all have features that can greatly enhance the usability of the products if you are able to use a little regex. So have a play and you’ll find that you can do some quite powerful things with just a little basic knowledge. I’d also be interested to see your examples of what you have done, so feel free to post a comment here and perhaps we can create a library of useful expressions for real life translation work.

Also see the following articles;

Regex… and “economy of accuracy” (Regular Expressions – Part 2)

Search and replace with Regex in Studio – Regular Expressions Part 3

DOGS and CATS… Regular Expressions Part 4!

16 comments
  1. Miran said:

    You wrote: “Every product we have, SDL Author Assistant, SDL MultiTerm, SDLX, SDL Trados 2007, SDL Trados Studio etc. all have features that can greatly enhance the usability of the products if you are able to use a little regex.”

    How do I use regular expressions in SDL Trados 2007? I tried to use regex in Concordance and Maintenance windows – but I did not get the response I expected.

    Thanks in advance for your answer.

    Miran

    Like

    • Hi Miran,

      I think the key part to my sentence was this part “…all have features …”. So Studio can use regex in many more places than 2007 for example. I believe 2007 only features the use of regex as part of the Snippet Mark-up plug-in for picking out and marking up non-translatable text and in the QA Checker for creating pattern based rules.

      So as far as I am aware there are no features in Trados 2007 for using regex anywhere else.

      Regards

      Paul

      Like

  2. Kevin said:

    Great blog!

    I’m wondering if Trados regex handling of multilines and line feeds is explained in more detail somewhere? I can get multiline regex expressions working in Expresso but when I copy them to Trados they don’t work. I have tried with/without the m modifier in the regex and with/without the Trados multiline option. I’ve also tried using the complete string parameter \Z instead of \n as the closing pattern to read the entire string but that didn’t work either.

    This is what I’m trying to import (.sbv format), there can be one or two lines of text per segment plus timestamp and blank line.

    0:02:28.250,0:02:30.417
    This segment is on one line

    0:02:31.709,0:02:35.542
    This segment is spread
    over two lines

    0:02:35.792,0:02:37.626
    Another one line segment

    Sample regex expression to match the lines (working in Expresso):

    (?m:^\d+?:\d+?:\d+?\.\d+?,\d+?:\d+?:\d+?\.\d\d\d)(.*\n)(.*\n)(.*\n)(.*\n)

    Thanks!

    Like

    • Try this.

      Opening Pattern:
      \d+:\d+:\d+\.\d+,\d+:\d+:\d+\.\d+

      Closing Pattern:
      \n(?=\d)

      Multiline checked.

      That works for me.

      Like

      • Kevin said:

        Awesome! Thanks. When it works, the regex import option can save a huge amount of file prep.

        Is there any documentation available comparing the differences of the Trados implementation of regex against a standard regex implementation?

        I’m still having some problems understanding the differences caused by the division of the expression in Trados into opening and closing patterns against a single standard regex expression and also the use of additional patterns for multilines.

        Like

      • What’s a standard regex implementation? Studio uses .NET which is “a” standard. I think your problem is that you didn’t define opening and closing patterns and for this filetype the document structure is defined by finding the characters that should be outside the translatable area. Your regex finds everything from the start of the opening pattern, includes the translatable text (so in effect excluding it from translation), and then capturing the start of the next opening pattern as well.
        So the opening pattern would be everything before the translatable text, and the closing pattern would be everything after it.

        Like

    • Kevin said:

      I understand that Trados uses the .NET regex implementation but the actual regex integration with Trados has it’s own quirks. For example:

      1. Trados divides the regex into opening and closing patterns – I haven’t seen this in any other regex tutorials, examples or tools (Expresso, etc.). Why does it do this and what are the differences with defining a match in a single expression (what I would call a “standard implementation” based on everything else I have read online)?
      2. Multiline: Trados uses a ‘Multiline’ parameter in the definition dialog, does this replace the m character? How does it handle line feeds other than \n ? Should the m parameter still work in spite of the dialog ‘Multiline’ option?

      Your tutorials are great and really helpful but some Trados specific context would prevent lost time for users that are new to this way of doing things.

      PS. Regarding the expression, you effectively explain what I was trying to do. I wanted to exclude the timestamp and include the following lines in the match.

      Like

      • It only does this with the text based filetype and this is to allow you to easily define what information should be excluded from the translatable text at a structural level. You can also specify inline tags and here you could have a single rule to select something you wanted to convert to a protected tag; but you could also have an opening and closing tag which would require an expression for each.
        I think the thing to get used to is that regex is only a tool for selecting something. It would be much more complicated to write a single expression to pick out the translatable text and exclude what came first and what came after that text. Breaking it up into simple chunks simplifies the expression you need.
        The tools you use to create the expressions are irrelevant here. You just need to understand what it is you are trying to write the expression for. So in this case for example it says in the filetype itself “Specify opening and closing regular expressions to define the start and end points of translatable text. Text between the matches for the opening and closing patterns will be extracted for translation. An opening and closing pair must occur on the same line unless the rule is designated as multiline.” I think that’s fairly explanatory, and even in Espresso you would create a similar expression to find the opening and closing patterns.

        Like

  3. caroline said:

    Hello Paul,
    thank you for all your help on different pages. Now, I’m trying to work with Regex Match AutoSuggest Provider. What I need to modify automatically is this:
    PD-C-Serie
    and I would like it to be managed to get
    Série PD-C
    Now, where is my mistake? I entered the following orders:
    Regex Pattern: (w+-\w+-)Serie
    Replace Pattern: Série $1

    Many thanks indeed.

    Like

    • I wonder why you don’t just search for the literal text if this is all you want? No need for a regex. Or use a termbase entry?

      Like

  4. caroline said:

    Hi,
    I don’t know whether my previous answer arrived. I’m looking for an automated solution because I’ve got quite a lot of such segments in a transaltion (i.e. “XX-X-Serie” or “XX-Serie”).
    Maybe I could work with search/replace, but there again, I’m not able to write the proper formula.(look for “AB-C-Serie”, replace with “Série AB-C”. I had a look at an example of yours for article numbers and thought it might be similar here.

    Any help would be welcome.

    Like

    • ok – try this then:

      Search for:
      ([\w-]+)-Serie

      Replace with:
      Série $1

      Your expression was almost correct for one of your cases… if you’d written this it would have worked:

      (\w+-\w+)-Serie

      You just forgot to escape the w at the start and then included the last hyphen in the back reference which you don’t want.

      The example I gave you catches “XX-X-Serie” or “XX-Serie” but you can always write multiple expressions as this is often easier and less prone to error.

      Like

  5. caroline said:

    Thanks ever so much!!!

    Like

  6. Marc said:

    Great blog! Thanks! But I’m stuck. I tried the example about translating just the text between quotes in a MS Word document and exclude the rest. I can’t find in SDL Studio where I can insert my regex syntax for MS Word document. The only thing I saw was under Options-File types-Embedded Content Processor-Plain text-Document Structure. I put my syntax there correctly (I know that syntax is good), but still to no avail. And how do you run that regex code? Once my translation text is in the editor, do I have to click on a button to run my regex code or it is done as soon as you open the project?
    Thanks!

    Like

    • Hi Marc, you can’t handle the text this way with a word document I’m afraid. If you want to do this you have two choices. First, copy the contents if it’s appropriate into a text file and handle it as per the blog; or second you give the text you don’t want to translate in Word a non-translatable style and add that style to the Word options.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: