The versatile regex based text filter in Trados Studio…

After attending the xl8cluj conference in Romania a few weeks ago, which was an excellent, and very technical conference for translators, I thought it was about time I wrote an article around the things you can do with the Regular Expression Delimited Text filter since it is so useful for solving all kinds of tasks related to text based files that don’t fit any of the out of the box formats available in the product.  Files such as software string files and csv files are common examples of where understanding how to work with this customisable file type can yield many benefits.  So this article is food for thought and a few things that might be helpful to you in the future.  It’s also pretty long (I’m not kidding!), so maybe grab a cup of coffee before you start to go through it!

CSV FILES

But Studio can handle a CSV file out of the box!  Well that’s true, but only if you have one column containing the source and one containing the target.  If you have a monolingual file containing text for translation that just happens to be separated by commas then the out of the box file type isn’t helpful.  I actually discussed this at length with a few users in Cluj as there are obviously some good workarounds for this and I imagine this is what most people do already when handling a file like this:

  • convert to Excel
  • import to Excel

Both of these are good workarounds, but the first one is a very tedious process if you have hundreds or even thousands of these files to contend with, and the second one would fail if the CSV files were formatted differently because you couldn’t easily get the target CSV files back out later.

Now having said this you only have to do a simple search for “batch convert csv to excel” in google and you’ll get loads of free options to make this easy for you.  But if I do that I can’t show you some really useful features of the Regular Expression Delimited Text filter which could be useful for other tasks… so instead let’s pretend I didn’t say that!

Step 1: Create the filetype

To begin with I go to File -> Options -> File Types and click on New… then select the Regular Expression Delimited Text.

This opens up the File Type Information pane and I complete these four fields (not all mandatory but they are useful):

  1. File type name: this provides a unique name to the file type.
  2. File type identifier: (not mandatory) this allows me to be sure I have used the correct file type when preparing projects as it shows up in the Files View after preparing my project and also in the orange tab at the top of each open file in Studio when I use TagID mode.
  3. File dialog wildcard expression: you need to make sure this says *.csv if you want it to be used to open a CSV file.
  4. Description: (not mandatory) this is just useful, especially if you create a lot of custom file types, especially if you share with others, as you can make a note of what the file type does.

Then you click on Finish and you should have your new file type ready to go… now that was easy!!

Step 2: Previewing your genius

I wanted to add this in because this feature in Studio is a fantastic time saver when you are working on your ingenious creation.  You just select one of your CSV files and click on Preview after each change to the settings in your file type:

  1. this is your new file type that should now be visible in your list.
  2. this is the Preview feature.  Just Browse… for your test file and then click on Preview.

When I do this I can see that I have opened up all the content of my file for translation, but it’s not segmented as I’d like on the comma:

So the next step has to be to create the rules ‘ll need to segment the file.

Step 3: Segmenting the text

Normally, when you need to segment your text you’d think about creating a custom segmentation rule in your TM since this is what drives your segmentation.  You could do this of course, but file types also assist in the segmentation of your file and this case I think it’s easier to manage in this way.  If nothing else it means you don’t have to use a different TM whenever you’re handling CSV files.

So, how do we go about this?  Well, the way I’m going to tackle it is by making the comma a non-translatable tag and set it as external.  This is pretty simple and I just do this:

  1. Open the new file type you created and click on Inline tags.
  2. Add a new rule.
  3. Specify that the Rule Type is a Placeholder, and use a comma as the Opening rule.
  4. Click on Advanced…
  5. Then set the rule to Exclude

That was pretty simple too… now we can preview the test file again by clicking on Preview (see how fast this test is!):

In general that seems pretty good… until we scroll down a little and now I see that segments 54 to 57 should actually be in one segment, and segments 59 and 60 should also be in one segment.  To understand why, and to illustrate the problem we have to solve, we need to look at the source file:

006,William-Adolphe Bouguereau,French,(1825-1905),"A Girl in Peasant Costume, Seated, Arms Folded, Holding a Ball of Wool and Knitting Needles in her Right Hand",1875,"1,305 €"

On inspecting the CSV file we can see that there are some lines, enclosed by double quotes, and this is because the use of the quotes tells a parser that can read CSV files not to separate on commas when they are within these quotes.  So simply using a comma as the rule to segment on is not enough for my files.  I need to be a little smarter.  To do this I need to create a regular expression that will only find commas where I need to segment.  So, this is what I did:

[^”] Match anything apart from a quote
* Keep matching anything apart from a quote
until you get to a quote
\B but only where the quote isn’t on a word boundary (commas and end of line are not recognised chars for a word boundary – not in \w)

This gives me this:

[^”]*”\B

Now, I want to find a comma that doesn’t fall within this search pattern.  So to do this I need to enclose this within what’s referred to as a negative lookahead:

(?![^”]*”\B)

A negative lookahead is just an assertion, it doesn’t actually match anything.  But if I add my comma because this is what I want to find, then I can now find commas but only when they’re not followed by what’s in the lookahead:

,(?![^”]*”\B)

Apologies if that was a little hard to follow… it’s a good example of why it’s important to learn a little about regular expressions.  If this is all new to you I’d recommend you start now as there are so many applications in a translation tool for using them and you can get a lot of benefits from this.  But I digress… back to our file type.  I can now replace the comma I used previously with my new expression like this:

And this time when I preview the file I see something like this:

That’s much better… only spoiled by the double quotes that have been included as translatable text.  But that’s easily solved by adding one more rule with a single double quote as a non-translatable placeable, and setting this as external so it’s removed from view:

And that’s it… for the sample files I used this does the job nicely and I can handle as many as I like without having to do any conversions at all.  If you want to work through this example, here’s a test file you can copy/paste to create a CSV like this one:

SKU,Name,Nationality,Lived,Artwork,Year,Est. Value
001,Leon Bakst,Russian,(1866-1924),Portrait of Virginia Zucchi,1917,"12,250 €"
002,Sir Max Beerbohm,British,(1872-1956),The Encaenia of 1908,1908,"17,100 €"
003,Ivan Yakovlevich Bilibin,Russian,(1876-1942),Design for the Costume of Babarikha (the Matchmaker) in Rimsky-Korsakov's Opera 'Tsar Sultan,1928,"12,000 €"
004,Richard Parkes Bonington,British,(1802-1828),Shipping Off the Kent Coast,1825,"7,500 €"
005,François Bonvin,French,(1817-1887),"A Seated Woman, Sewing by a Table",1848,"10,250 €"
006,William-Adolphe Bouguereau,French,(1825-1905),"A Girl in Peasant Costume, Seated, Arms Folded, Holding a Ball of Wool and Knitting Needles in her Right Hand",1875,"1,305 €"
007,Ford Madox Brown,British,(1821-1893),Study for a Greyhound,1850,"25,950 €"
008,Alexander Pavlovich Bryulov,Russian,(1798-1877),"Portrait of Marie-Amélie, Queen of the French",1860,"8,400 €"
009,Paul Cézanne,French,(1839-1906),"Studies of a Child's Head, a Woman's Head, a Spoon, and a Longcase Clock",1872,"32,350 €"
010,Jean-Baptiste Camille Corot,French,(1796-1875),Civita Castellana: A Woodland Stream in a Rocky Gully,1826,"12,750 €"

Now we can take a look at some software string files that are also not handled out of the box.

SOFTWARE STRING FILES

These are file types I see coming up all the time in the forums in some form or another.  Unfortunately they are often the most inconsistent files in terms of the syntax being used, but we can work around this easily enough using our rules.  So, what do these file types look like?  Most of the time they are key-value pair files, so I’ll use these as an example and I’m pretty sure you’ll be able to adapt the rules to suit any variants to this on your own… but if you can’t you can always ask for help in the SDL Community where you’ll find plenty of help from the many smart users in there:

Apple define their strings files like this:

/* Question in confirmation panel for quitting. */
"Confirm Quit" = "Are you sure you want to quit?";

/* Message when user tries to close unsaved document */
"Close or Save" = "Save changes before closing?";

These have three components to them:

  1. a comment enclosed with the /* and */ syntax.
  2. a key enclosed in double quotes preceding the equals sign
  3. a value enclosed in double quote after the equals sign

The ideal way to handle these files is to use SDL Passolo where the file preparation is a breeze and you can export to SDLXLIFF to translate in Studio afterwards if you prefer.  Using the DSI Viewer from the SDL AppStore means you can see the comment and the key for each value being translated as you work… very neat and simple:

But… if you’re a Studio user without access to Passolo, and you’ve been asked to handle a file like this, which we see happening all the time, then here’s a solution using the Regular Expression Delimited Text file type.

Step 1: Create the filetype

This is exactly same as we did before, except this time you probably have to use *.strings as the File dialog wildcard expression.

Step 2: Previewing your genius

We can see here in our preview pane that the entire contents of the file are being extracted and segmented on the basis of Studio default rules:

So if we want to be able to see all of this information then we need to try and do a couple of things:

  1. extract the comment and lock it so we can still see it, but ignore it during translation
  2. segment the key-value pair so they are on separate lines
  3. lock the key so it can be seen but ignored during translation

Step 3: Extract the comment

This pretty straightforward, we just create an Inline tag rule using a Tag pair like this:

I removed some of the steps this time on the basis you would have no problem doing this after following the more detailed steps for the CSV file type above. Hopefully this also helps to see how simple this can be for all kinds of text based filetypes.

I could set this rule  as Include under the Advanced… options so this now gets me this when I preview:

You can see that the comments are visible, in their own segment even if there is a period at the end of the comment, and also protected so you won’t translate them.

Important note:

However, in practice I had a small error repeating this for the other rules when I tried to lock the key in the same way I tackled the first rule above.  So instead I took a different approach and set all the rules as translatable instead (which will remove the locked status in the image above) and used the formatting feature to colour the text red… I’ll explain why I did this shortly.

Step 4: Segment the key-value pairs

To tackle this I created two new rules, this simple tag pair rule to extract the key string and also colour it red as I eventually did for the comment:

^” Opening: Match a double quote at the start of the line
“\s Closing: Match a double quote followed by a space

That should allow me to extract the key string in the highlighted text below:

/* Question in confirmation panel for quitting. */
"Confirm Quit" = "Are you sure you want to quit?";

/* Message when user tries to close unsaved document */
"Close or Save" = "Save changes before closing?";

Then I created another tag pair rule to extract the value string which is actually the one I want to translate, but didn’t colour the text:

=\s” Match an equals sign followed by a space and a double quote
“;$ Match a double quote followed by a semi-colon at the end of the line

That should allow me to extract the key string in the highlighted text below:

/* Question in confirmation panel for quitting. */
"Confirm Quit" = "Are you sure you want to quit?";

/* Message when user tries to close unsaved document */
"Close or Save" = "Save changes before closing?";

This nicely previews like this:

You can see that the text I need to translate is black, and the comment and the key string are in red, but visible to me while translating.

Step 5: Filter out and lock the non-translatable segments

Now, why did I colour them red apart from the obvious reason which is to be able to distinguish them from the translatable text?  Well, if I open one of these apple strings files in the Studio Editor I can now use the Community Advanced Display Filter to filter on the red coloured text, like this:

So now I’m only displaying the segments I don’t want to translate.  Next I just copy source to target, change the status to translated and lock them.  I can now clear the filter to see this:

Perfect… I now get these benefits:

  1. my analysis will only include the translatable text
  2. when I confirm a segment I will only ever move to the next segment for translation
  3. I can always read the comments
  4. I can always see the key string

… and I get one little annoyance!  The segment with text between tags (the comment) retains the red colour when I lock the segment whilst the other segment does not.  If I did this again I’d use grey as the colour as opposed to red because I find it distracting… but I’m leaving this here because you may also come across a similar problem as me.

WHAT ELSE?

Well, the apple strings file was just one typical example, so here’s a few more (just the settings and what you should get) so you have some idea of how to use the rules for these sort of files that we do see quite often in the community forums.

Another way to handle our apple strings example

If you’re only interested in the translatable text and don’t want to see the comments or key strings at all then you can also handle this using the Document Structure node in the file type settings by telling it exactly what you want to extract in the first place :

“.+=\s” Match a double quote, keep matching any character until it’s possible to match an equals sign followed by a space and a double quote
“;$ Match a double quote followed by a semi-colon at the end of the line

That should allow me to extract the value string in the highlighted text below:

/* Question in confirmation panel for quitting. */
"Confirm Quit" = "Are you sure you want to quit?";

/* Message when user tries to close unsaved document */
"Close or Save" = "Save changes before closing?";

This nicely previews like this:

LNG (Language Resource Files) and PHP (Array Files)

These types of files are used by various software applications and I have seen them (rightly or wrongly) with different file type endings, so it’s important to note the ending when you create your file type as you’ll need to use this in the File dialog wildcard expression which you’ll recall from step 1 in the CSV example.  Typical examples I’ve come across are things like *.lng, *.ini, *.php, *.txt.  The main thing is that the format of the text in the file could be something like this where you’re interested in getting at the highlighted text only:

lng file example

[trPrint]
TR_About="&About..."
TR_FormCaption="Find Text..."
TR_SaveFilePositions="&Remember editing positions"

or something like this:

PHP array file

<?php
/* en.php - english language file */
$messages['hello'] = 'Hello';
$messages['signup'] = 'Sign up for free';
?>

All of these sort of files follow the same basic principle (as far as we’re concerned for the file type creation) and can be handled easily using the Document Structure node in the file type settings as we did earlier for the simplified apple strings file type.  For the language resource file, lng, I could use something like this:

.+=” Keep matching any character until it’s possible to match an equals sign followed by a double quote
“$ Match a double quote at the end of the line

Which should get me:

Not bad… but I could improve on this and also protect the accelerator keys you can see in the text (& symbol) which will also help me with QA to avoid these important missing tags.  To do this I just add a simple placeholder rule in the Inline tags and make sure the Inline tag behaviour is set to Include:

Now I have this:

So that was simple enough… and what about the PHP array file.  A very similar task, and I could solve it with these opening and closing patterns in the Document structure node:

.+\s’ Keep matching any character until it’s possible to match a space followed by a single quote
‘;$ Match a single quote followed by a semi-colon at the end of the line

Which should get me:

So very similar and very straightforward.

Final Words

If you have a text based file type like these and after reading this post are still having problems then feel free to share a snippet and if I can do it I’ll add your example to the list.  This sort of thing comes up so often I think the more we have as examples the better.

I also had some thoughts around what’s lacking with the current features for handling these file types in Studio and if we don’t see them in the product in the future we might take a look at handling them through the SDL AppStore:

  • ability to define a pattern you can assign as a comment
  • ability to define a pattern you can assign as Document Structure Information
  • ability to define source and target patterns in case the file you have is multilingual/bilingual

If you have any other thoughts of your own feel free to add them and we can consider these as well.  In the meantime I’ve added these to the SDL Ideas site… so go and vote today!

Wot! No target!!

The origin of Chad (if you’re British) or Kilroy (if you’re American) seems largely supposition.  The most likely story I could find, or rather the one I like the most, is that it was created by the late cartoonist George Edward Chatterton ‘Chat’ in 1937 to advertise dance events at a local RAF (Royal Air Force) base.  After that Chad is remembered for bringing attention to any shortages, or shortcomings, in wartime Britain with messages like Wot! No eggs!!, and Wot! No fags!!.  It’s not used a lot these days, but for those of us aware of the symbolism it’s probably a fitting exclamation when you can’t save your target file after completing a translation in Trados Studio!  At least that would be the polite exclamation since this is one of the most frustrating scenarios you may come across!

At the start of this article I fully intended this to be a simple description of the problems around saving the target file, but like so many things I write it hasn’t turned out that way!  But I found it a useful exercise so I hope you will too.  So, let’s start simple despite that introduction because the reasons for this problem usually boil down to one or more of these three things:

  1. Not preparing the project so it’s suitable for sharing
  2. Corruption of a project file
  3. A problem with the source file or the Studio filetype

Continue reading “Wot! No target!!”

Priorities… paths… filetypes….

At the beginning of each year we probably all review our priorities for the New Year ahead so we have a well balanced start… use that gym membership properly, study for a new language, get accredited in some new skill, stop eating chocolate… although that may be going just a bit too far, everything is fine with a little moderation!  I have to admit that moderating chocolate isn’t, and may never be, one of my strong points even though it’s on my list again this year!  But the idea of looking at our priorities and setting them up appropriately is a good one so I thought I’d start off 2018 with a short article explaining why this is even important when using SDL Trados Studio, particularly because I see new users struggling with, or just not being aware of, the concepts around the prioritisation of filetypes.  If you don’t understand them then you can find code doesn’t get tagged correctly despite you setting it up, or non-translatable text is always getting extracted for translation even though you’re sure you excluded it, or even files being completely mishandled. Continue reading “Priorities… paths… filetypes….”

Double vision!!

There are well over 200 applications in the SDL AppStore and the vast majority are free.  I think many users only look at the free apps, and I couldn’t blame them for that as I sometimes do the same thing when it comes to mobile apps.  But every now and again I find something that I would have to pay for but it just looks too useful to ignore.  The same logic applies to the SDL AppStore and there are some developers creating some marvellous solutions that are not free.  So this is the first of a number of articles I’m planning to write about the paid applications, some of them costing only a few euros and others a little more. Are they worth the money?  I think the developers deserve to be paid for the effort they’ve gone to but I’ll let you be the judge of that and I’ll begin by explaining why this article is called double vision!!

From time to time I see translators asking how they can get target documents (the translated version) that are fully formatted but contain the source and the target text… so doubling up on the text that’s required.  I’ve seen all kinds of workarounds ranging from copy and paste to using an auto hotkey script that grabs the text from the source segment and pastes it into the target every time you confirm a translation. It’s a bit of an odd requirement but since we do see it, it’s good to know there is a way to handle it. But perhaps a better way to handle it now would be to use the “RyS Enhanced Target Document Generator” app from the SDL AppStore? Continue reading “Double vision!!”

Iris Optical Character Recognition

I’m back on the topic of PDF support!  I have written about this a few times in the past with “I thought Studio could handle a PDF?” and “Handling PDFs… is there a best way?“, and this could give people the impression I’m a fan of translating PDF files.  But I’m not!  If I was asked to handle PDF files for translation I’d do everything I could to get hold of the original source file that was used to create the PDF because this is always going to be a better solution.  But the reality of life for many translators is that getting the original source file is not always an option.  I was fortunate enough to be able to attend the FIT Conference in Brisbane a few weeks ago and I was surprised at how many freelance translators and agencies I met dealt with large volumes of PDF files from all over the world, often coming from hospitals where the content was a mixture of typed and handwritten material, and almost always on a 24-hr turnaround.  The process of dealing with these files is really tricky and normally involves using Optical Character Recognition (OCR) software such as Abbyy Finereader to get the content into Microsoft Word and then a tidy up exercise in Word.  All of this takes so long it’s sometimes easier to just recreate the files in Word and translate them as you go!  Translate in Word…sacrilege to my ears!  But this is reality and looking at some of the examples of files I was given there are times when I think I’d even recommend working that way!

Continue reading “Iris Optical Character Recognition”

Cutie Cat?

A nice picture of a cutie cat… although I’m really looking for a cutie linguist and didn’t think it would be appropriate to share my vision for that!  More seriously the truth isn’t as risqué… I’m really after Qt Linguist.  Now maybe you come across this more often than I do so the solutions for dealing with files from the Qt product, often shared as *.TS files, may simply role off your tongue.  I think the first time I saw them I just looked at the format with a text editor, saw they looked pretty simple and created a custom filetype to deal with them in Studio 2009.  Since that date I’ve only been asked a handful of times so I don’t think about this a lot… in fact the cutie cat would get more attention!  But in the last few weeks I’ve been asked four times by different people and I’ve seen a question on proZ so I thought it may be worth looking a little deeper.

Continue reading “Cutie Cat?”

All that glitters is not gold…

001Years ago, when I was still in the Army, there was a saying that we used to live by for routine inspections.  “If it looks right, it is right”… or perhaps more fittingly “bullshit baffles brains”.  These were really all about making sure that you knew what had to be addressed in order to satisfy an often trivial inspection, and to a large extent this approach worked as long as nobody dug a little deeper to get at the truth.  This approach is not limited to the Army however, and today it’s easy to create a polished website, make statements with plenty of smiling users, offer something for free and then share it all over social media.  But what is different today is that there is potential to reach tens of thousands of people and not all of them will dig a little deeper… so the potential for reward is high, and the potential for disappointment is similarly high.

Continue reading “All that glitters is not gold…”