Archive

Tag Archives: regex

I’m pretty sure that when we started to build the new Customer Experience Team in Cluj last year that there was nothing in the job description about being competitive… but wow, they are!!!  I’d be lying if I said I wasn’t competitive, because I know I am, but it’s been a long time since I’ve had these kinds of feelings that keep me up at night.

To some extent I think the training requirements at SDL are the perfect fuel for this type of environment and I haven’t made up my mind yet whether it’s healthy or not.  But in their roles the team speak with customers through the online chat, in the community, via email… basically anywhere anyone comes in with a question because they don’t have a support contract or an account manager to ask and they didn’t know about the SDL Community which is of course the best place to go for help.  To be able to answer the variety of technical questions we see, all the team have either completed or are working through the various SDL Certifications available at a rate of knots and are learning more about the sort of problems faced by translators and project managers just by having to help people every day.  They are doing a fantastic job!

So where am I going with this introduction?  Well, sometimes we see translators asking questions about how to use regular expressions, so I spent a little time introducing them to the team.  I think a small amount of knowledge about regular expressions allows you reap huge benefits in many places in our products as well as in other applications you probably use like Notepad++  or Microsoft Word for example.  I’ve written about regular expressions in the past and have a few articles you can find using this link, or this one on Word.  But to really make it easy to get started, and fun, I introduced the team to the Regex Crossword which provides somebasic tutorials and a fun way to start getting used to using regular expressions.  In no time at all you’ll know enough to do some really powerful things in Studio to get a little competitive edge and this is really what I want to talk about… where in the products can you use regular expressions and why?

The article itself is quite long, but I hope it’s useful and most of all if you have not ventured into the world of regular expressions I hope it encourages you to have a go and learn just a little.  I promise you that with just a little basic knowledge you can achieve a lot without breaking anything using the search and filtering capabilities.  I also promise you that you can break a lot using the replace capabilities… so take care with anything that replaces and make sure you have a back up first!!

Using regular expressions in Studio

Search & Replace in the Studio Editor

Let’s say you translated a file containing lots of numbers using the Euro symbol and when you’d finished you realised that the translation should be written as 12,50 € as opposed to € 12,50. Do you translate them all again?  Of course not, you just activate the use of regular expressions in the Search & Replace dialogue when the file is open for translation and search for something like this:

€\s(\d+,\d{2})

\s means a space
\d means a single number from 0-9
+ means find whatever you were looking for at least once, but keep looking until you find no more
{2} means find exactly 2 of whatever you were looking for
() putting round brackets around the expression means remember whatever you found inside these brackets

So fairly simple stuff, but very powerful when used in the right context.

Then replace whatever you found with this:

$1 €

$1 is the expression you use to recall whatever you remembered inside the brackets, it’s called a back reference

You can activate the Search & Replace with a keyboard shortcut, Ctrl+H, and set up your search & replace like this:

I have written about the Search & Replace capabilities using Regular Expressions in the past, so maybe this is worth a read too.

Search & Replace using Apps

The same principle can be used in many applications from the SDL AppStore.  For example, if you had the same situation and wanted to tackle this slightly differently with two processes not possible out of the box in Studio:

  1. Change the numbers in the source with a Search & Replace
  2. Search & Replace across 500 files in one go without opening them

A useful application for this is the SDLXLIFF Toolkit.  Here you can drag and drop the SDLXLIFF files, or just the SDLPROJ file to achieve the same thing, and apply the Search & Replace like this:

The article referenced will tell you more about what’s possible with this app but the ability to make changes in the source, across all your files at once, and see the effect of the changes before you actually apply them are great features.

Quality Assurance

A not often enough used feature, possibly because of the perceived lack of capability, or maybe complexity, is the ability to create your own QA rules using regular expressions.  I’ve touched on QA before, and I focused on the use of regular expressions so perhaps worth a read.  But just on this simple problem of the Euro, you could easily use this QA feature to find any translations not correctly transposed in the first place.

Exactly the same rule just looking in the target segment for any matching pattern, and if one is found you’ll know about it immediately as you work:

This is obviously a simple example, but the more knowledge you have about how regular expressions work the more complex scenarios you can handle using the built in QA checks in Studio.

Filtering in the display filters

There are actually three useful display filters available for Studio (two out of the box and one in the appstore) and they can all use regular expressions:

  1. The Display Filter in the Review tab
  2. The Advanced Display Filter in the View tab
  3. The Community Advanced Display Filter from the SDL AppStore

All of them give you an ability to select only certain segments at a time and do something with them.  For example you might have segments containing only numbers and these could be handled really quickly using any of these tools by selecting them, copying source to target, manipulating them with a regex search & replace perhaps and then changing the status to translated and locking them.  For some files the ability to run a process like this can be a game changer because you no longer have to be focused on the mind numbingly boring task of making sure that each number is being handled correct by Studio as you translate.

If we come back to our simple Euro example where we happen to have sentences that contain only a number like this then we can’t use the built in Number Only option in the display filter in the Review tab because of the Euro symbol, but we could use a simple regular expression.  In fact we can use the same expression we have been using all along with the addition of a couple of anchors.

^€\s(\d+,\d{2})$

The caret, or circumflex, symbol at the start of this expression means only find matches where the first character in the string is a Euro.  The dollar symbol at the end of the expression means only find matches that don’t have any more text after the numbers, so they end at this point.  Basically this forces segments that are our Euro values on their own to be displayed.  I can put this expression in here as this filter assumes regex by default:

The two Advanced Display Filters will get you the very same results used like this… and remembering to check the Regular Expression check box or the search will be handled as plain text:

The use of regular expressions to filter segments is very powerful, and although I always stress the idea of keeping it simple, there are times when it can be quite tricky to filter on what you actually need.  Sometimes you can find what you don’t want more easily, and this is where the Community Advanced Display Filter is brilliant because of this option alone:

So you search for what you don’t want and then just click on Reverse Filter to get what you do.  That is easily my favourite feature in the Community Advanced Display Filter… so obvious and yet so powerful!

Regex AutoSuggest Provider

I can’t talk about the use of regular expressions in Studio without covering off the superb Regex AutoSuggest Provider.  Nora Diaz wrote a nice article about this so it’s worth referring to that for more details since the explanations are really based around why this is a useful tool for translators.  But in a nutshell the application allows you to transpose the way text is written as you type and insert it through AutoSuggest.

So let’s go back to our simple Euro example where we can add this rule in the Regex AutoSuggest Provider:

Exactly the same expressions we have been using and in effect it’s like a search & reply on the fly.  So as soon as I start to type the matching replace pattern starting with a number I see this where I can simply hit the return key and the number will be inserted exactly as I need it:

This tool actually enhances the concept of regex in Studio quite a bit through it’s use of variables that allow you to handle lists of search & replace as you type with a single search and replace pattern… a very smart application.

Segmentation Rules

This is all about preparing a file for translation before you start to translate so you can avoid the need for having to split (usually split… could be merge too) sentences manually as you go.  For example, if you received a file for translation that contained lists of SEO (Search Engine Optimization) content then you might find they are unique words and phrases all separated by a comma depending on how the lists were generated.  This would be tricky to translate unless you split the sentences to ensure that each keyword was on a separate line… well not tricky to translate, but long meaningless sentences and very little reuse of the translated material afterwards through your Translation Memory.  If you create a segmentation rule on your Translation Memory then a file containing this kind of content could be split before you even start:

Now this is a very simple example which can even be created in the Studio interface without knowing any regular expressions, but it’s the concept I want to get across.  Before the break on the left I used one expression:

.[,]+

Match anything (the dot) followed by a comma and then break.  After the break there can be anything (the dot again).  The concept being if you have a pattern in the text which could be used to identify where to split the segments then splitting them as you prepare the files for translation is the best way as you’ll have no work to do on the file in Studio as you work.  So that simple regular expression rule can have this sort of effect when you open the file for translation:

You can find some much better examples and how they are implemented in this article, another one written by Nora Diaz.

Custom Filetypes

Regular expressions play a very valuable role in preparing custom filetypes, whether this is simply handling embedded content in XML files, or in Excel, or it’s creating a complete filetype to process parts of a text based file you’ve received for translation.  In fact without any knowledge of regular expressions you’d find this very tricky indeed.  I can’t use my example of the Euro for this one so let’s just look at some simple theory using Excel… take a file with things like this for example:

You don’t really want these kind of things to appear as translatable text as it’s too easy to make a mistake and then the target file may not be fit for purpose when it goes back translated into it’s original application.  So, you could use a few simple regex rules like these and you get them all:

%\d+ for a placeholder
\[\w+\] for a placeholder
<[^/>]+> for the opening tag of a tag pair
</[^>]+> for the closing tag of a tag pair

Perhaps a little mind blowing complexity here as I introduced some new concepts, but not if you understand a few simple rules.

\[ and \] are ways to find the square brackets.  Square brackets have a special meaning in Regular Expressions so if you want to find them as plain text you have to escape them first.  You do this with a backslash.  The \w is shorthand for a word character, so any letter, ideograph, digit or connector punctuation.

On the expressions for html… simple enough but uses an interesting way to think about things.  First of all the expression looks for a less than symbol.  Once it finds a pattern starting with this symbol it looks for what’s inside the square brackets.  Remember I just mentioned that square brackets have a special meaning if they are not escaped, and in this case they are not escaped.  They contain a caret symbol and a greater than symbol.  Inside square brackets the caret symbol means match anything apart from what you see next, and the next thing is the greater than symbol.  This is then followed by a plus outside the brackets which we know means keep looking for what you just matched (anything apart from the greater than symbol) and then stop when you match a greater than symbol as this is the next thing in the pattern.  Probably good to review this article if I lost you a little here as it explains in more detail how these things work, and if you have any more questions post them in the comments or ask in the SDL Community where there is a special forum dedicated just for asking questions on the use of regular expressions in Studio.

Getting back to Studio, you would simply enter these three expressions in here… refer to this article for the detail on how to do this:

My file now looks like this:

As opposed to this:

The SDL AppStore – CleanUp Tasks

Now if, for example, you received a Studio or WorldServer package prepared by someone else and they had not taken the time to prepare the file as we did above then you can still deal with it using an app called CleanUp Tasks that is freely available on the SDL AppStore.  This clever application can do a lot of things and it makes use of regular expressions.  So by using these rules in the application you can prepare your files for translation yourself and save your project manager the hassle of trying to find out where some text that should have been protected in the first place was incorrectly transposed:

There are a few things you need to know about how to use the application which I don’t intend to get into here, but if you review the articles you’ll see just how much you can achieve with a little knowledge of regular expressions.

Translation Memory Find & Replace text

So far this has all been about what you can do with the translatable text in the Studio Editor so I thought I’d add a little something on where you can use regular expressions in Studio when working on a Translation Memory.  Unfortunately regular expressions are not supported in the simple search.  You can use wildcards in here where I searched for any segments starting with the letters wiki followed by sources somewhere in the segment:

There are only two types of wildcards you can use, described in the help documentation as follows:

  • Use a question mark ? wildcard to match a single character at the current position.
  • Use an asterisk * wildcard to match zero or more characters at the current position.

So this is very basic and it would be great to have full regular expression support in here some time in the future.  But there is one fairly hidden place that I think only the most experienced users ever get to and that’s in the Find and Replace Text as part of the Batch Edit features.  So fairly well hidden!

Here I used the same Euro example we had at the start and used the very same expressions to correct the content of the Translation Memory in the target segments where the incorrect transposition of the Euro symbol was used.  This is actually a very powerful feature because we often see people asking how to clean up their Translation Memories by replacing incorrect content.  A word of caution though… Translation Memories can be very large and contain a lifetime of translation work.  This feature has no preview and no undo… so BACKUP your Translation Memories before you start to engage in anything like this.  This app may be interesting for you in that regard!

Back to the crossword!

If you’ve read this far, and have been bitten by the regex crossword bug then you’ll probably see that there’s a stats section showing you how many users are signed up, how many puzzles have been solved by users and how many user puzzles have been created… these are the stats at the time of writing:

This is where it gets dangerously competitive because we’re all trying to improve our score!  I’m not going to share who’s who but if you like a little competition, improved knowledge of regular expressions and a lot of sleepless nights then I’m happy to share my own progress for fun… this is me!  Post your progress in the comments if you’re up for a challenge!

One last point I’d make since I’m sure someone will pick this up is that the Regex Crossword uses the Javascript Regex engine whereas Studio uses .NET.  There are differences, but having completed almost 300 of these regex crosswords I think I can safely say that the differences are minimal and don’t take away from the ability to understand the constructs and how they are applied.  Still a great way to learn!

Every now and then I see an application and I think… this one is going to be a game changer for Studio users.  There have been a few, but the top two for me have been the “SDLXLIFF to Legacy Converter” which really helped users working with mixed workflows between the old Trados tools and the new Studio 2009, and the “Glossary Converter” which has totally changed the way translators view working with terminology and in my opinion has also been responsible for some of the improvements we see in the Studio/MultiTerm products today.  There are many more, and AnyTM is a contender, but if I were to only pick my top three where I instantly thought WOW!, then the first two would feature.  So what about the third?  You could say I have the benefit of hindsight with the first two although I’m not joking about my reaction when I first saw them, but the third is brand new and I’m already predicting success!

Read More

001It’s been a while since I wrote anything about the SDLXLIFF Toolkit.. in fact I haven’t done since it was first released with the 2014 version of Studio.  Now that we have added a few new things such as SDLPLUGINS so that apps are better integrated and can be more easily distributed with Studio we have launched a new version of the toolkit for Studio 2017.  What’s new?  To be honest not a lot, but there are a couple of things that I think warrant this visit.

First of all, the app is now a plugin and this means it loads faster, is always available and there are a few tricks to being able to get the most from this.  Secondly, there are a few fixes to the search & replace features that make it possible to complete tasks that Studio will fail with and to do this the API team completely rebuilt the regex engine.  So whilst you won’t see too many changes, there are a few under the hood.

The best way to illustrate this is to show you so I have created a short video below where I have tried to explain how best to use the toolkit now it’s a plugin and not a standalone application, and I used the problems described below to demonstrate how it works.  If you want to know what else it can do I have reproduced part of the original guide below the video as that seems to have been lost over the years.  This might be helpful for a few of the more obscure features you may not have realised were possible.

Read More

001Wow… how time flies!  Over three years ago I wrote an article called AutoCorrect… for everything! which explained how to use AutoHotkey so you had a similar functionality to Microsoft Word for autocorrect, except it worked in all your windows applications.  This was, and still is, pretty cool I think and I still use autohotkey today for many things, and not just autocorrect.  Since writing that article we released Studio 2015, and in fact Studio 2017 is just around the corner, so it was a while back and some things have moved on.  For example, Studio 2015 introduced an autocorrect feature into Studio which meant things should be easier for all Studio users, especially if they had not come across autohotkey before.

Read More

01Drink deep, or taste not the Pierian Spring:
There shallow Draughts intoxicate the Brain,
And drinking largely sobers us again.

I’m quoting Alexander Pope in 1709, rightly or wrongly, for hitting the nail on the head when it comes to the truly intoxicating mix of language and technology.  A little knowledge is indeed a dangerous thing and it’s something I know I’ve been guilty of all my life… I learn a little something new and now I’m an expert.  That is of course until I learn a bit more, and then a little more after that, and before I know it I realise I know nothing at all!  Translation technology is great for dropping us all into this trap… Trados user since Trados 5, translator for over 20-years… can handle any type of file.  Falling into this trap is pretty easy in fact, especially when the tools available for translation today take a lot of the effort out of the tasks at hand.  But not everything is what it seems and sometimes it takes a mistake or three to sober us up again!  There’s a reason why well organised and successful translation companies, dealing in all kinds of content, have Project Managers, Translators and Localization Engineers within their midst.

Read More

01Update Sept 2016: You can find an excellent filetype plugin for JSON files on the SDL AppStore if you don’t want to tackle this yourself.

The JSON files… not really related to Jason Voorhees of course, but for some users who have received these file types for translation the problem of how to handle them and extract the appropriate text may well seem like an episode of Friday the 13th!  I’ve seen a few threads in the last couple of weeks sharing various methods for handling these files ranging from opening them in MSWord and applying a hidden style to the parts you don’t want, to asking vendors to create variations on javascript filetypes.  But I think Studio offers a much simpler mechanism for handling them out of the box.

So what are these file types and how can you handle them with Studio 2014, or even 2009/2011?  In this article I’m going to look at the regex filetype as this is very well suited to files like this, but before we get into that detail let’s take a look at what they are. Read More

001The AutoSuggest feature in Studio has been around since the launch of Studio 2009 and based on the questions I see from time to time I think it’s a feature that could use a little explanation on what it’s all about.  In simple terms it’s a mechanism for prompting you as you type with suggested target text that is based on the source text of the document you are translating.  So sometimes it might be a translation of some or all of the text in the source segment, and sometimes it might be providing an easy way to replicate the source text into the target.  This is done by you entering a character via the keyboard and then Studio suggests suitable text that can be applied with a single keystroke.  In terms of productivity this is a great feature and given how many other translation tools have copied this in one form or another I think it’s clear it really works too!

AutoSuggest comes from a number of different sources, some out of the box with every version of the product, and some requiring a specific license.  The ability to create resources for AutoSuggest is also controlled by license for some things, but not for all.  When you purchase Studio, any version at all, you have the ability to use the AutoSuggest resources out of the box from three places: Read More

%d bloggers like this: