Archive

Tag Archives: regex-tutorial

I’m pretty sure that when we started to build the new Customer Experience Team in Cluj last year that there was nothing in the job description about being competitive… but wow, they are!!!  I’d be lying if I said I wasn’t competitive, because I know I am, but it’s been a long time since I’ve had these kinds of feelings that keep me up at night.

To some extent I think the training requirements at SDL are the perfect fuel for this type of environment and I haven’t made up my mind yet whether it’s healthy or not.  But in their roles the team speak with customers through the online chat, in the community, via email… basically anywhere anyone comes in with a question because they don’t have a support contract or an account manager to ask and they didn’t know about the SDL Community which is of course the best place to go for help.  To be able to answer the variety of technical questions we see, all the team have either completed or are working through the various SDL Certifications available at a rate of knots and are learning more about the sort of problems faced by translators and project managers just by having to help people every day.  They are doing a fantastic job!

So where am I going with this introduction?  Well, sometimes we see translators asking questions about how to use regular expressions, so I spent a little time introducing them to the team.  I think a small amount of knowledge about regular expressions allows you reap huge benefits in many places in our products as well as in other applications you probably use like Notepad++  or Microsoft Word for example.  I’ve written about regular expressions in the past and have a few articles you can find using this link, or this one on Word.  But to really make it easy to get started, and fun, I introduced the team to the Regex Crossword which provides somebasic tutorials and a fun way to start getting used to using regular expressions.  In no time at all you’ll know enough to do some really powerful things in Studio to get a little competitive edge and this is really what I want to talk about… where in the products can you use regular expressions and why?

The article itself is quite long, but I hope it’s useful and most of all if you have not ventured into the world of regular expressions I hope it encourages you to have a go and learn just a little.  I promise you that with just a little basic knowledge you can achieve a lot without breaking anything using the search and filtering capabilities.  I also promise you that you can break a lot using the replace capabilities… so take care with anything that replaces and make sure you have a back up first!!

Using regular expressions in Studio

Search & Replace in the Studio Editor

Let’s say you translated a file containing lots of numbers using the Euro symbol and when you’d finished you realised that the translation should be written as 12,50 € as opposed to € 12,50. Do you translate them all again?  Of course not, you just activate the use of regular expressions in the Search & Replace dialogue when the file is open for translation and search for something like this:

€\s(\d+,\d{2})

\s means a space
\d means a single number from 0-9
+ means find whatever you were looking for at least once, but keep looking until you find no more
{2} means find exactly 2 of whatever you were looking for
() putting round brackets around the expression means remember whatever you found inside these brackets

So fairly simple stuff, but very powerful when used in the right context.

Then replace whatever you found with this:

$1 €

$1 is the expression you use to recall whatever you remembered inside the brackets, it’s called a back reference

You can activate the Search & Replace with a keyboard shortcut, Ctrl+H, and set up your search & replace like this:

I have written about the Search & Replace capabilities using Regular Expressions in the past, so maybe this is worth a read too.

Search & Replace using Apps

The same principle can be used in many applications from the SDL AppStore.  For example, if you had the same situation and wanted to tackle this slightly differently with two processes not possible out of the box in Studio:

  1. Change the numbers in the source with a Search & Replace
  2. Search & Replace across 500 files in one go without opening them

A useful application for this is the SDLXLIFF Toolkit.  Here you can drag and drop the SDLXLIFF files, or just the SDLPROJ file to achieve the same thing, and apply the Search & Replace like this:

The article referenced will tell you more about what’s possible with this app but the ability to make changes in the source, across all your files at once, and see the effect of the changes before you actually apply them are great features.

Quality Assurance

A not often enough used feature, possibly because of the perceived lack of capability, or maybe complexity, is the ability to create your own QA rules using regular expressions.  I’ve touched on QA before, and I focused on the use of regular expressions so perhaps worth a read.  But just on this simple problem of the Euro, you could easily use this QA feature to find any translations not correctly transposed in the first place.

Exactly the same rule just looking in the target segment for any matching pattern, and if one is found you’ll know about it immediately as you work:

This is obviously a simple example, but the more knowledge you have about how regular expressions work the more complex scenarios you can handle using the built in QA checks in Studio.

Filtering in the display filters

There are actually three useful display filters available for Studio (two out of the box and one in the appstore) and they can all use regular expressions:

  1. The Display Filter in the Review tab
  2. The Advanced Display Filter in the View tab
  3. The Community Advanced Display Filter from the SDL AppStore

All of them give you an ability to select only certain segments at a time and do something with them.  For example you might have segments containing only numbers and these could be handled really quickly using any of these tools by selecting them, copying source to target, manipulating them with a regex search & replace perhaps and then changing the status to translated and locking them.  For some files the ability to run a process like this can be a game changer because you no longer have to be focused on the mind numbingly boring task of making sure that each number is being handled correct by Studio as you translate.

If we come back to our simple Euro example where we happen to have sentences that contain only a number like this then we can’t use the built in Number Only option in the display filter in the Review tab because of the Euro symbol, but we could use a simple regular expression.  In fact we can use the same expression we have been using all along with the addition of a couple of anchors.

^€\s(\d+,\d{2})$

The caret, or circumflex, symbol at the start of this expression means only find matches where the first character in the string is a Euro.  The dollar symbol at the end of the expression means only find matches that don’t have any more text after the numbers, so they end at this point.  Basically this forces segments that are our Euro values on their own to be displayed.  I can put this expression in here as this filter assumes regex by default:

The two Advanced Display Filters will get you the very same results used like this… and remembering to check the Regular Expression check box or the search will be handled as plain text:

The use of regular expressions to filter segments is very powerful, and although I always stress the idea of keeping it simple, there are times when it can be quite tricky to filter on what you actually need.  Sometimes you can find what you don’t want more easily, and this is where the Community Advanced Display Filter is brilliant because of this option alone:

So you search for what you don’t want and then just click on Reverse Filter to get what you do.  That is easily my favourite feature in the Community Advanced Display Filter… so obvious and yet so powerful!

Regex AutoSuggest Provider

I can’t talk about the use of regular expressions in Studio without covering off the superb Regex AutoSuggest Provider.  Nora Diaz wrote a nice article about this so it’s worth referring to that for more details since the explanations are really based around why this is a useful tool for translators.  But in a nutshell the application allows you to transpose the way text is written as you type and insert it through AutoSuggest.

So let’s go back to our simple Euro example where we can add this rule in the Regex AutoSuggest Provider:

Exactly the same expressions we have been using and in effect it’s like a search & reply on the fly.  So as soon as I start to type the matching replace pattern starting with a number I see this where I can simply hit the return key and the number will be inserted exactly as I need it:

This tool actually enhances the concept of regex in Studio quite a bit through it’s use of variables that allow you to handle lists of search & replace as you type with a single search and replace pattern… a very smart application.

Segmentation Rules

This is all about preparing a file for translation before you start to translate so you can avoid the need for having to split (usually split… could be merge too) sentences manually as you go.  For example, if you received a file for translation that contained lists of SEO (Search Engine Optimization) content then you might find they are unique words and phrases all separated by a comma depending on how the lists were generated.  This would be tricky to translate unless you split the sentences to ensure that each keyword was on a separate line… well not tricky to translate, but long meaningless sentences and very little reuse of the translated material afterwards through your Translation Memory.  If you create a segmentation rule on your Translation Memory then a file containing this kind of content could be split before you even start:

Now this is a very simple example which can even be created in the Studio interface without knowing any regular expressions, but it’s the concept I want to get across.  Before the break on the left I used one expression:

.[,]+

Match anything (the dot) followed by a comma and then break.  After the break there can be anything (the dot again).  The concept being if you have a pattern in the text which could be used to identify where to split the segments then splitting them as you prepare the files for translation is the best way as you’ll have no work to do on the file in Studio as you work.  So that simple regular expression rule can have this sort of effect when you open the file for translation:

You can find some much better examples and how they are implemented in this article, another one written by Nora Diaz.

Custom Filetypes

Regular expressions play a very valuable role in preparing custom filetypes, whether this is simply handling embedded content in XML files, or in Excel, or it’s creating a complete filetype to process parts of a text based file you’ve received for translation.  In fact without any knowledge of regular expressions you’d find this very tricky indeed.  I can’t use my example of the Euro for this one so let’s just look at some simple theory using Excel… take a file with things like this for example:

You don’t really want these kind of things to appear as translatable text as it’s too easy to make a mistake and then the target file may not be fit for purpose when it goes back translated into it’s original application.  So, you could use a few simple regex rules like these and you get them all:

%\d+ for a placeholder
\[\w+\] for a placeholder
<[^/>]+> for the opening tag of a tag pair
</[^>]+> for the closing tag of a tag pair

Perhaps a little mind blowing complexity here as I introduced some new concepts, but not if you understand a few simple rules.

\[ and \] are ways to find the square brackets.  Square brackets have a special meaning in Regular Expressions so if you want to find them as plain text you have to escape them first.  You do this with a backslash.  The \w is shorthand for a word character, so any letter, ideograph, digit or connector punctuation.

On the expressions for html… simple enough but uses an interesting way to think about things.  First of all the expression looks for a less than symbol.  Once it finds a pattern starting with this symbol it looks for what’s inside the square brackets.  Remember I just mentioned that square brackets have a special meaning if they are not escaped, and in this case they are not escaped.  They contain a caret symbol and a greater than symbol.  Inside square brackets the caret symbol means match anything apart from what you see next, and the next thing is the greater than symbol.  This is then followed by a plus outside the brackets which we know means keep looking for what you just matched (anything apart from the greater than symbol) and then stop when you match a greater than symbol as this is the next thing in the pattern.  Probably good to review this article if I lost you a little here as it explains in more detail how these things work, and if you have any more questions post them in the comments or ask in the SDL Community where there is a special forum dedicated just for asking questions on the use of regular expressions in Studio.

Getting back to Studio, you would simply enter these three expressions in here… refer to this article for the detail on how to do this:

My file now looks like this:

As opposed to this:

The SDL AppStore – CleanUp Tasks

Now if, for example, you received a Studio or WorldServer package prepared by someone else and they had not taken the time to prepare the file as we did above then you can still deal with it using an app called CleanUp Tasks that is freely available on the SDL AppStore.  This clever application can do a lot of things and it makes use of regular expressions.  So by using these rules in the application you can prepare your files for translation yourself and save your project manager the hassle of trying to find out where some text that should have been protected in the first place was incorrectly transposed:

There are a few things you need to know about how to use the application which I don’t intend to get into here, but if you review the articles you’ll see just how much you can achieve with a little knowledge of regular expressions.

Translation Memory Find & Replace text

So far this has all been about what you can do with the translatable text in the Studio Editor so I thought I’d add a little something on where you can use regular expressions in Studio when working on a Translation Memory.  Unfortunately regular expressions are not supported in the simple search.  You can use wildcards in here where I searched for any segments starting with the letters wiki followed by sources somewhere in the segment:

There are only two types of wildcards you can use, described in the help documentation as follows:

  • Use a question mark ? wildcard to match a single character at the current position.
  • Use an asterisk * wildcard to match zero or more characters at the current position.

So this is very basic and it would be great to have full regular expression support in here some time in the future.  But there is one fairly hidden place that I think only the most experienced users ever get to and that’s in the Find and Replace Text as part of the Batch Edit features.  So fairly well hidden!

Here I used the same Euro example we had at the start and used the very same expressions to correct the content of the Translation Memory in the target segments where the incorrect transposition of the Euro symbol was used.  This is actually a very powerful feature because we often see people asking how to clean up their Translation Memories by replacing incorrect content.  A word of caution though… Translation Memories can be very large and contain a lifetime of translation work.  This feature has no preview and no undo… so BACKUP your Translation Memories before you start to engage in anything like this.  This app may be interesting for you in that regard!

Back to the crossword!

If you’ve read this far, and have been bitten by the regex crossword bug then you’ll probably see that there’s a stats section showing you how many users are signed up, how many puzzles have been solved by users and how many user puzzles have been created… these are the stats at the time of writing:

This is where it gets dangerously competitive because we’re all trying to improve our score!  I’m not going to share who’s who but if you like a little competition, improved knowledge of regular expressions and a lot of sleepless nights then I’m happy to share my own progress for fun… this is me!  Post your progress in the comments if you’re up for a challenge!

One last point I’d make since I’m sure someone will pick this up is that the Regex Crossword uses the Javascript Regex engine whereas Studio uses .NET.  There are differences, but having completed almost 300 of these regex crosswords I think I can safely say that the differences are minimal and don’t take away from the ability to understand the constructs and how they are applied.  Still a great way to learn!

DOGS and CATS...When I first started adding articles about how to use regular expressions I thought I’d only write three… but I had an interesting question from one of our resellers, Agenor (actually Agenor always asks me the hardest questions!), about how to use the display filter to find segments that contain one word, but not another.  It was tricky, but once you have it you can use the expression all the time.  I have a collection of such things from when people ask me, so I thought I’d share how this problem was solved and also post a list of some of the useful regular expressions I have saved for  the display filter in Studio 2011.

Read More

The final article (in this introductory series anyway) on regular expressions in Studio is looking at how to use search and replace in Studio.  This capability, to use regex to replace as well as search, will only be possible with the update release of SDL Trados Studio 2011 SP2 and later and it’s a very welcome addition to the toolset provided within Studio.

Read More

In Regular Expressions – Part 1 I wrote a summary of where regular expressions could be used in SDL Trados Studio, and I covered a couple of examples.  I also referred to RegexBuddy quite a lot as this is a really useful tool in helping you write and understand regular expressions.  But in case learning another application is something you don’t want to do I thought it would be handy to go through what I think are the most useful applications of regular expressions for every day use in SDL Trados Studio, and also share a few tips on how to use Studio to verify the expressions are finding what you need as well as introduce a little “economy of accuracy“.

Read More

Regular Expressions, often referred to as Regex, are something that come up again and again in forums, roadshows and the occasional questions.  So I thought it might be useful to take a better look at them and how they can be useful for translators.  To begin with I’m republishing a blog article I wrote a year or so ago on a different site so I can build on this theme in one location.

Read More

%d bloggers like this: