If there’s one thing I firmly believe it’s that I think all translators should learn a little bit of regex, or regular expressions.  In fact it probably wouldn’t hurt anyone to know how to use them a little bit simply because they are so useful for manipulating text, especially when it comes to working in and out of spreadsheets.  When I started to think about this article today I was thinking about how to slice up text so that it’s better segmented for translation; and I was thinking about what data to use.  I settled on lists of data as this sort of question comes up quite often in the community and to create some sample files I used this wikipedia page.  It’s a good list, so I copied it as plain text straight into Excel which got me a column of fruit formatted exactly as I would like to see it if I was translating it, one fruit per segment.  But as I wanted to replcate the sort of lists we see translators getting from their customers I copied the list into a text editor and used regex to replace the hard returns (\r\n) with a comma and a space, then broke the file up alphabetically… took me around a minute to do.  I’m pretty sure that kind of simple manipulation would be useful for many people in all walks of life.  But I digress….

I now have three files created from my list and I’ll use these to try and explain the concepts of segmenting text in SDL Trados Studio through the use of segmentation rules in your translation memory.

  1. a comma plus space separated list
  2. a comma only separated list
  3. a tab delimited list

In Studio without creating custom segmentation rules I’m going to see something like these:

comma plus space

comma only

tab delimited

I think all of these are the most common sort of problems we see users dealing with and hopefully they will allow me to explain the concepts you can use for segmenting on anything.  So, first of all let’s see where you create these rules.  Trados Studio segments files based on the structure of the file first and then using segmentation rules in your translation memory.  So if you have a word file and you end a sentence wth a full stop followed by a space, or you press the enter key which inserts a hard return (or a paragraph break), you will get a separate sentence (or segment) in Studio when you open them up.  Like this:

If you are working with markup files, like XML or HTML, then the segmentation is controlled by the parser rules for the separate elements.  For example, opening an XML file with the parser rules set incorrectly could lead to something like this where the file is segmented based on full stops at the end of each sentence, but also where the inline elements (<nice>, <info> and <not>) have been incorrectly set as structural elements (see this article for more information on this rather large topic):

But if the structure of the file is all set up correctly then the only thing you might have to address is the segmentation of text that is placed into Studio as a single segment when you’d prefer to see it broken down further, which brings me back to my sliced fruit.

Segmentation Rules

Segmentation rules are held on the translation memory or in language resource templates.  The options are exactly the same in both places.  Language resource templates are very similar to project templates in that they provide you with a way to create a new translation memory based on a configuration you use a lot.  So things like your Variable List, Abbreviation List, Ordinal Follower List and Segmentation Rules can all be set up in a language resource template and used as the basis of a new transaltion memory whenever you create one:

You create a new language resource template in the Translation Memories view by selecting New -> New Language Resource Template as shown above.  You can also create a new translation memory based on a previous one, but using these templates is handy because they take up little space and can easily be shared with others.  I think it would be handy to have an Apply Translation Memory Template application that worked in a similar way to Apply Project Template… and we might look at the feasibility of this in the near future.

To edit the segmentation rules in a translation memory you open it in the Translation Memory view and then select Settings from the ribbon, or right-click and select Settings from the context menu:

An important point to note is that you must select a translation memory to use the ribbon icon because otherwise it will be greyed out.  Once you’ve done this you’ll find the segmntation rules under the Language Resources node.  If you create a language resource template you’ll find the only settings in there are these listed under the Language Resources node shown below, so everything I’m about to explain is the same for both.

You then select Segmentation Rules and click on Edit.  There are two options in the next window:

  • Paragraph based segmentation
  • Sentence based segmentation

The first option, paragraph based, does not support customised segmentation rules.  It is just a way to segment your files based on paragraph as opposed to sentence and a paragraph is determined by the filetype structure I explained earlier.  You can read a little more about paragraph segmentation in this article as it’s an useful option under the right circumstances.  For our sliced fruit we are going to be working with the Sentence based segmentation where you’ll find three default rules:

  • Full stop rule
  • Other terminating punctuation (question mark, explanation mark)
  • Colon

Generally I’d leave these alone unless you have a very specific reason for wanting to change them and you understand why this will be necessary.  There are options when you edit them to add exceptions in addition to, or instead of, changing the rules and if you do play around in this area I’d recommend trying to use the exceptions first as this is usually a lot safer.  But we’re going to add new rules.

Adding Segmentation Rule

New segmentation rules work by defining three pieces of information:

  1. what characters are there before you break to start a new sentence
  2. what character do you actually want to break on
  3. what characters appear after the break

There are two views where you can apply these rules, a Basic View:

And an Advanced View:

You’ll notice that the Advanced View only has two places for information, whereas the Basic has three.  This is because the Advanced View uses regular expressions and the Before break pattern incorporates the first two pieces of information that you enter into the Basic View.

The rules you create are handled sequentially.  So if we wanted to segment the first fruit file which is a comma plus space between the fruits we have to do two things:

  1. break before the comma
  2. break after the space

The reason we want to do this is so we are able to handle the words on their own and just filter out the comma and space in the editor.  Doing this is not always obvious because the basic View doesn’t give us all the options for the things we need, unless they are really basic, and trying to add them in the Basic View and then switching to the Advanced View often leads to expressions that don’t work the way you expected as they can be escaped twice which then looks for the existence of a backslash as opposed to the backslash referring to a particular pattern.  So to work around this I always use the information that is correct in the Basic View and then use a capital X for the information that is not there.  This allows me to edit the Advanced View more easily as the basic requirements are there.  This is best explained by an example:

Before break – I want to identify that there are letters before the last letter which is the break character

Break characters – this will be the last letter before the comma. I can’t enter this with an expression so I use a capital X

After break – I want a comma and a space, so use \s for the space and check Regular Expression

When I switch to the Advanced View I see this:

It’s hard to read, so this is what I have:

[\w\p{P}][X]+
,\s

It’s simple to see the capital X now and all I have to do is replace this with the regex for a word character.  For this I’ll use a \w and my expression looks like this:

[\w\p{P}][\w]+
,\s

When I open the file using the translation memory containing these rules Studio will open the file like image below with the text segmenting before the comma.

So all I have to do now is create the second rule to break after the comma and the space.  This one is quite interesting because I don’t have anything before the break as the comma is now at the start of the segment, so I delete this entry and leave it blank.  I then enter a capital X for the break characters so I can add the comma and the space in the Advanced View and select Text for after the break:

This gives me the following:

[X]+
[\w\p{P}]

I add a regex for the comma and space to replace the X like this:

[,\s]+
[\w\p{P}]

Then when I open the file in Studio this time I get exactly what I need and I can filter out the commas to see something like this:

Now working with the file for translation is a doddle and I’ll ensure only the words are entered into my translation memory, and it’s easier to add them to a termbase if I like and ensure term recognition as the commas won’t be in the way.

Finally, as this article got much longer than I originally intended (hopefully because I included things that are useful for you and not just because I thought it was a good idea!) I have created a video showing how to do this for all three fruit files in succession.

Duration: 15 min 16 seconds

When I write these articles I always start with thinking about the image at the top.  I do this for two reasons, the first is because it usually helps me think of some bizarre introduction (like this!) that helps me start writing, and the second is because every now and again I like to play around with Gimp which is the free image software I occassionally use.  It’s always nice to spend a little time doing something frivolous because it’s good thinking time without being distracted by the job!  I don’t really know how to use this software at all, but it’s fun seeing what turns out… and I confess I often use a combination of powerpoint and Gimp simply because some things are just easier in powerpoint!  Eventually I might actually learn how to use it properly… I’ll keep practicing anyway.

Read More

There are people who believe that the original intention of the internet during its inception in the 1980’s was to put the power of information in the hands of its users.  In fact the last three or four decades has seen the return of the wild wild west with the internet, e-mail, mobile technology, social media, online shopping, big data, cloud computing and now the internet of things.  All of this has been accessible to anyone, and anyone with the ability to create a website can give the impression they are far more trustworthy and capable than they actually are.  The way the growth of the internet has taken place has meant that only large organisations are able, in theory, to provide “security” and “trust” and we rely on them to validate our financial transactions, willingly handing over our personal data so that we no longer have any control over what happens with it.  Since the global social media phenomenon we even hand this data over to less secure environments sharing our lives with the world and in the process becoming less and less oblivious to the implications of what we share.  Certainly a far cry from the original idea of a secure and private network for the users, and today individuals have next to zero control over their personal data at all.

Read More

Every time a new release of SDL Trados Studio is released there are usually a flurry of blogs and videos explaining what’s in them, some are really useful and full of details that will help a user decide whether the upgrade is for them or not, and others are written without any real understanding of what’s in the software or why the upgrade will help.  That’s really par for the course and always to be expected since everyone is looking for the things they would like to meet their own needs.  So for me, when I’m looking for independent reviews of anything, I find the more helpful reviews give me as much information as possible and I can make my own mind up based on the utility I’ll get from it, the fun in using it and the cost of upgrade.  I put a couple of what I would consider helpful reviews here as they both try to cover as many of the new features available as possible.  So if you are in the early stages of wondering at a high level what’s in it for you then you could do a lot worse than spending 10 or 20 minutes of your time to read/watch the contributions from Emma and Nora below.

Read More

In the last year or so many articles have been written about XLIFF 2.0 explaining what’s so great about it, so I’m not going to write another one of those.  I’m in awe of the knowledge and effort the technical standard committees display in delivering the comprehensive documentation they do, working hard to deliver a solution to meet the needs of as many groups as possible.  The very existence of a standard however does not mean it’s the panacea for every problem it may be loosely related to.  It’s against this background I was prompted to write about this topic after reading this article questionning whether some companies were preventing translators from improving their lives.  The article makes a number of claims which I think might be a little misguided in my opinion… in fact this is what it says:

XLIFF 2.0 is a “new” bilingual format for translation that attempts to do a handful important things for translators.

  • Improve the standard so that different translation tools makers, like SDL, don’t “need” to create their own proprietary versions that are not compatible with other tools
  • Creating true interoperability among tools, so translators can work in the tool of their choice, and end-customers can have flexibility about who they work with too
  • Allow businesses to embed more information in the files, like TM matches glossaries, or annotations, further enhancing interoperability

I say “new” because XLIFF 2.0 has been around for years now. Unfortunately, adoption of the XLIFF 2.0 standard has been slow, due to tools makers and other players deciding that interoperability is not in their interest. It’s one of those things where commerce gets in the way of sanity.

Read More

Studio 2019 has arrived and it brings with it some nice features on the surface, and some important improvements under the hood… but it also brings with it a lot more upgrades than just Studio, and I don’t just mean MultiTerm!  The SDL AppStore is one of the unique benefits you get when you work on the SDL technology stack and there are hundreds of apps available that can provide additional resources, custom filetypes, file converters, productivity enhancements, manuals, etc.  When you upgrade your version of Studio you are also going to have to upgrade your apps.  Many of the apps are maintained by the SDL Community team and these have all been upgraded ready for use in Studio 2019, but the majority have been created and maintained by others.  I’ve written this article to explain what you need to look out for as a user of SDL Trados Studio or MultiTerm, and also as a reference guide for the developers who might have missed the important information that was sent out to help them with the process. Read More

It could be said that translators come into the industry for the love of language, and the creative nature of the work, writing beautiful translations that at least do justice to the original texts.  It might even be true for many… but let’s face it, very few people can afford to do this for a full career without thinking about the money!  So it’s all the more surprising to me that translation vendors don’t provide a mechanism for dealing with the money in their toolsets.  Sure, you can have an analysis that can be used as the basis of a quote or an invoice, but you don’t see anywhere that deals with the money!  The larger Translation Management Systems have features for doing this, or they integrate with larger Enterprise systems for accounting and project management, but what about the translators?  How do they manage their business?

Well… there are applications on the SDL AppStore that can help with this in some ways.  For example:

  • SDL InQuote – an interesting, sometimes problematic application, that can allow you to create quotes and invoices based on the analysis files in your Studio projects
  • Post-Edit Compare – a wonderful application that in addition to carrying out a post-edit analysis of the work you are doing can put a value to it based on your rates.  But it doesn’t create quotes or invoices.
  • Qualitivity – another wonderful application that in addition to tracking just about everything you do in Studio can put a value to it based on the post-edit analysis or on a time basis.  But it doesn’t create quotes or invoices either.

Read More

%d bloggers like this: