Archive

Tag Archives: translation memory

If there’s one thing I firmly believe it’s that I think all translators should learn a little bit of regex, or regular expressions.  In fact it probably wouldn’t hurt anyone to know how to use them a little bit simply because they are so useful for manipulating text, especially when it comes to working in and out of spreadsheets.  When I started to think about this article today I was thinking about how to slice up text so that it’s better segmented for translation; and I was thinking about what data to use.  I settled on lists of data as this sort of question comes up quite often in the community and to create some sample files I used this wikipedia page.  It’s a good list, so I copied it as plain text straight into Excel which got me a column of fruit formatted exactly as I would like to see it if I was translating it, one fruit per segment.  But as I wanted to replcate the sort of lists we see translators getting from their customers I copied the list into a text editor and used regex to replace the hard returns (\r\n) with a comma and a space, then broke the file up alphabetically… took me around a minute to do.  I’m pretty sure that kind of simple manipulation would be useful for many people in all walks of life.  But I digress….

I now have three files created from my list and I’ll use these to try and explain the concepts of segmenting text in SDL Trados Studio through the use of segmentation rules in your translation memory.

  1. a comma plus space separated list
  2. a comma only separated list
  3. a tab delimited list

In Studio without creating custom segmentation rules I’m going to see something like these:

comma plus space

comma only

tab delimited

I think all of these are the most common sort of problems we see users dealing with and hopefully they will allow me to explain the concepts you can use for segmenting on anything.  So, first of all let’s see where you create these rules.  Trados Studio segments files based on the structure of the file first and then using segmentation rules in your translation memory.  So if you have a word file and you end a sentence wth a full stop followed by a space, or you press the enter key which inserts a hard return (or a paragraph break), you will get a separate sentence (or segment) in Studio when you open them up.  Like this:

If you are working with markup files, like XML or HTML, then the segmentation is controlled by the parser rules for the separate elements.  For example, opening an XML file with the parser rules set incorrectly could lead to something like this where the file is segmented based on full stops at the end of each sentence, but also where the inline elements (<nice>, <info> and <not>) have been incorrectly set as structural elements (see this article for more information on this rather large topic):

But if the structure of the file is all set up correctly then the only thing you might have to address is the segmentation of text that is placed into Studio as a single segment when you’d prefer to see it broken down further, which brings me back to my sliced fruit.

Segmentation Rules

Segmentation rules are held on the translation memory or in language resource templates.  The options are exactly the same in both places.  Language resource templates are very similar to project templates in that they provide you with a way to create a new translation memory based on a configuration you use a lot.  So things like your Variable List, Abbreviation List, Ordinal Follower List and Segmentation Rules can all be set up in a language resource template and used as the basis of a new transaltion memory whenever you create one:

You create a new language resource template in the Translation Memories view by selecting New -> New Language Resource Template as shown above.  You can also create a new translation memory based on a previous one, but using these templates is handy because they take up little space and can easily be shared with others.  I think it would be handy to have an Apply Translation Memory Template application that worked in a similar way to Apply Project Template… and we might look at the feasibility of this in the near future.

To edit the segmentation rules in a translation memory you open it in the Translation Memory view and then select Settings from the ribbon, or right-click and select Settings from the context menu:

An important point to note is that you must select a translation memory to use the ribbon icon because otherwise it will be greyed out.  Once you’ve done this you’ll find the segmntation rules under the Language Resources node.  If you create a language resource template you’ll find the only settings in there are these listed under the Language Resources node shown below, so everything I’m about to explain is the same for both.

You then select Segmentation Rules and click on Edit.  There are two options in the next window:

  • Paragraph based segmentation
  • Sentence based segmentation

The first option, paragraph based, does not support customised segmentation rules.  It is just a way to segment your files based on paragraph as opposed to sentence and a paragraph is determined by the filetype structure I explained earlier.  You can read a little more about paragraph segmentation in this article as it’s an useful option under the right circumstances.  For our sliced fruit we are going to be working with the Sentence based segmentation where you’ll find three default rules:

  • Full stop rule
  • Other terminating punctuation (question mark, explanation mark)
  • Colon

Generally I’d leave these alone unless you have a very specific reason for wanting to change them and you understand why this will be necessary.  There are options when you edit them to add exceptions in addition to, or instead of, changing the rules and if you do play around in this area I’d recommend trying to use the exceptions first as this is usually a lot safer.  But we’re going to add new rules.

Adding Segmentation Rule

New segmentation rules work by defining three pieces of information:

  1. what characters are there before you break to start a new sentence
  2. what character do you actually want to break on
  3. what characters appear after the break

There are two views where you can apply these rules, a Basic View:

And an Advanced View:

You’ll notice that the Advanced View only has two places for information, whereas the Basic has three.  This is because the Advanced View uses regular expressions and the Before break pattern incorporates the first two pieces of information that you enter into the Basic View.

The rules you create are handled sequentially.  So if we wanted to segment the first fruit file which is a comma plus space between the fruits we have to do two things:

  1. break before the comma
  2. break after the space

The reason we want to do this is so we are able to handle the words on their own and just filter out the comma and space in the editor.  Doing this is not always obvious because the basic View doesn’t give us all the options for the things we need, unless they are really basic, and trying to add them in the Basic View and then switching to the Advanced View often leads to expressions that don’t work the way you expected as they can be escaped twice which then looks for the existence of a backslash as opposed to the backslash referring to a particular pattern.  So to work around this I always use the information that is correct in the Basic View and then use a capital X for the information that is not there.  This allows me to edit the Advanced View more easily as the basic requirements are there.  This is best explained by an example:

Before break – I want to identify that there are letters before the last letter which is the break character

Break characters – this will be the last letter before the comma. I can’t enter this with an expression so I use a capital X

After break – I want a comma and a space, so use \s for the space and check Regular Expression

When I switch to the Advanced View I see this:

It’s hard to read, so this is what I have:

[\w\p{P}][X]+
,\s

It’s simple to see the capital X now and all I have to do is replace this with the regex for a word character.  For this I’ll use a \w and my expression looks like this:

[\w\p{P}][\w]+
,\s

When I open the file using the translation memory containing these rules Studio will open the file like image below with the text segmenting before the comma.

So all I have to do now is create the second rule to break after the comma and the space.  This one is quite interesting because I don’t have anything before the break as the comma is now at the start of the segment, so I delete this entry and leave it blank.  I then enter a capital X for the break characters so I can add the comma and the space in the Advanced View and select Text for after the break:

This gives me the following:

[X]+
[\w\p{P}]

I add a regex for the comma and space to replace the X like this:

[,\s]+
[\w\p{P}]

Then when I open the file in Studio this time I get exactly what I need and I can filter out the commas to see something like this:

Now working with the file for translation is a doddle and I’ll ensure only the words are entered into my translation memory, and it’s easier to add them to a termbase if I like and ensure term recognition as the commas won’t be in the way.

Finally, as this article got much longer than I originally intended (hopefully because I included things that are useful for you and not just because I thought it was a good idea!) I have created a video showing how to do this for all three fruit files in succession.

Duration: 15 min 16 seconds

Is English (Europe) the new language on the other side of the Channel that we’ll all have to learn if Brexit actually happens… will Microsoft ever create a spellchecker for it now they added it to Windows 10?  Why are there 94 different variants of English in Studio coming from the Microsoft operating system and only two Microsoft Word English spellcheckers?  Why don’t we have English (Scouse), English (Geordie) or English (Brummie)… probably more distinct than the differences between English (United States) and English (United Kingdom) which are the two variants Microsoft can spellcheck.  These questions, and similar ones for other language variants are all questions I can’t answer and this article isn’t going to address!  But I am going to address a few of the problems that having so many variants can create for users of SDL Trados Studio.

Read More

Using segmentation rules on your Translation Memory is something most users struggle with from time to time; but not just the creation of the rules which are often just a question of a few regular expressions and well covered in posts like this from Nora Diaz and others.  Rather how to ensure they apply when you want them, particularly when using the alignment module or retrofit in SDL Trados Studio where custom segmentation rules are being used.  Now I’m not going to take the credit for this article as I would not have even considered writing it if Evzen Polenka had not pointed out how Studio could be used to handle the segmentation of the target language text… something I wasn’t aware was even possible until yesterday.  So all credit to Evzen here for seeing the practical use of this feature and sharing his knowledge.  This is exactly what I love about the community, everyone can learn something and in practical terms many of SDLs customers certainly know how to use the software better than some of us in SDL do!

Read More

The handling of numbers and units in Studio is always something that raises questions and over the years I’ve tackled it in various articles.  But one thing I don’t believe I have specifically addressed, and I do see this rear its head from time to time, is how to handle the spaces between a number and its unit.  So it thought it might be useful to tackle it in a simple article so I have a reference point when asked this question, and perhaps it’ll be useful for you at the same time.

I have a background in Civil Engineering so when I think about this topic I naturally fall back to “The International System of Units (SI)” which has a clear definition on this topic:

Read More

001“More power to the elbow”… this is all about getting more from the resources you have already got, and in this case I’m talking about your Translation Memories.  In particular I’m talking about enabling them for upLIFT.  upLIFT, in case you have not heard about this yet despite all the marketing activity and forum discussions since August this year, is a technology that is being used in SDL Trados Studio 2017 to enable some pretty neat things.  I’m not going to devote this article to what upLIFT is all about as Emma Goldsmith has written a really useful article today that does a far better job than I could have done.  You can find Emma’s article here, called “SDL Trados studio 2017 : fragment recall and repair“.  But a quick summary to get us started is that upLIFT enables things like this:

  • fragment matching
    • whole Translation Units
    • partial Translation Units
  • fuzzy match repair
    • from fragment matching
    • from your termbase
    • from Machine Translation

Read More

001Back in July 2013 I wrote an article called “Fields and Attributes in Studio” which was all about adding different types of metadata to your Translation Units every time you confirmed a segment to make it easier, or more complex depending on what you’ve done, to manage your Translation Memories.  If you’re not sure what I mean by this take a look at the article as I won’t repeat a lot of that here… at least I’ll try not to!  This capability in Studio is probably quite familiar to most users of the old SDL Trados 2007 and earlier, and was even essential to some extent because you could only use a single Translation Memory at a time.

Read More

Copyright Rudall30 | Dreamstime.comI’ve written about how to handle bilingual excel files, csv files and tab delimited files in the past.  In fact one of the most popular articles I have ever written was this one “Creating a TM from a Termbase, or Glossary, in SDL Trados Studio” in July 2012, over three years ago.  Despite writing it I’m still struggling a little with why this would be useful other than if you have been given a glossary to translate or proofread perhaps… but nonetheless it doesn’t really matter what I think because clearly it was useful!

So, why am I bringing this up three years later?  Well, the recent launch of Studio 2015 introduced a new filetype that seems worthy of some discussion.  It’s a Bilingual Excel filetype that allows you to handle excel files with bilingual content in a similar fashion to the way it used to be possible in the previous article.  There are some interesting differences though, and notably the first would be that you won’t lose any formatting in the excel file which is something that happened if you had to handle files like these as CSV or Tab Delimited Text.  That in itself mught be interesting for some users because this was the first thing I’d hear when suggesting the CSV filetype as a solution for handling files of this nature.  Most of the time I don’t think this is really an issue but for those occasions where it is this is a good point.

Read More

%d bloggers like this: