Psst… wanna know a few more things about file types?

I wrote under this title back in 2013 and provided a bit of information about the Word filetypes in Studio.  It was a pretty popular article and I always meant to circle back and do some more.  Seven is a lucky number so now we’re in 2020, seven years later, I thought I’d do it again… and it’s also just as long, so grab a coffee first!

The message I wanted to leave users with after that article is that it’s always worth opening the options from time to time and just exploring the filetype options, particularly for the filetypes you work with often.  You never know what you’ll find in there.  If you did this there is one thing you’d find that we have not been making a song and dance about, and that’s embedded content options.  What do I mean by this?  If you take a quick review of this article where I covered the handling of taggy Excel files you’ll know what I mean.  Embedded content (in the context of this article) is usually text, programming code, or markup, that has been added to your file in such a way that it will be extracted as translatable content, but you want to protect it from translation.  So you want to “tag it” which is an “affectionate” name given to protecting the code from being changed during the translation process and at the same time make it easier for the translator to work around it.

The way embedded content is handled

Today, in Trados Studio 2021, you might not have realised that over 60% of filetypes do have the ability to handle embedded content.  But they don’t all do it the same way or carry the same advantages, or disadvantages depending on what the content is.  In fact there seems to be two different ways this can be represented and quite a few different ways of representing where this feature is in each filetype.

To try and simplify this I’m going to split this article into three:

  1. handling embedded content using tag definition rules
  2. handling embedded content using the Embedded Content Processor Filetypes
  3. handling embedded content in other ways

I don’t intend to show every configuration in the options but I have identified which filetypes support each type below.

Just with tag definition rules

This approach is the same as described in the article on handling taggy files from Excel.  You’ll find it under some of the filetype options like this:

The use of this approach is quite simple.  You activate the embedded content processing by checking this box:

Telling Studio which parts of the document your rules should apply to:

And finally adding your rules:

The rules can be used to define placeholders or tag pairs, and for tag pairs you can define whether the text between them should be translatable or not.  So a reasonable amount of flexibility in here, but you do need to have some knowledge of regular expressions if you want to avoid having to create a lot of rules.

Affected file types

Which ones adopt this approach?  These ones:

  • Microsoft Word 2007-2019 (WordprocessingML v. 2)
  • Microsoft Word 97-2003 (DOC v 2.0.0.0)
  • Microsoft PowerPoint 2007-2019 (PresentationML v. 1)
  • Microsoft PowerPoint 97-2003 (PPT v 2.0.0.0)
  • Microsoft Excel 2007-2019 (SpreadsheetML v. 1)
  • Microsoft Excel 97-2003 (XLS v 3.0.0.0)
  • Bilingual Excel (Bilingual Excel v 1.0.0.0)
  • XLIFF (XLIFF 1.1-1.2 v 2.0.0.0)
  • XLIFF 2.0 (XLIFF 2.0 v 1.0.0.0)
  • Java Resources (Java Resources v 2.0.0.0)
  • Portable Object (PO files v 1.0.0.0)
  • Subtitle formats (Subtitles v 1.0.0.0)
  • XML: Author-it Compliant (XML: Author-it 1.2 v 1.0.0.0)
  • XML: Any XML (XML: Any v 1.2.0.0)
  • Text (Plain Text v 1.0.0.0)
  • Custom filetypes
    • Regular Expression Delimited Text (RegEx v 1.0.0.0)
    • XML (Legacy Embedded Content) (XML v 1.2.0.0)

Using an Embedded Content Processor Filetype

More recent versions of Studio have gradually introduced the concept of chaining filetypes together.  So you could open a file with one filetype, but handle the embedded content with another one.  A good example of where this is really appropriate is when the embedded content is HTML with a lot of HTML syntax to create rules for.  Using the HTML filetype as an embedded content processor means the markup is all handled perfectly and you don’t need to write any rules with regular expressions yourself.

This technique manifests itself in the filetypes in a few ways… so in the email filetype it’s like this under the “Common” node as opposed to “Embedded Content“:

In the XHTM 1.1 (2) filetype it’s under “Embedded Content” node:

But in all cases the basic idea is the same.  You select the filetype you wish to be used to process the embedded content from a drop down list similar to this:

By default you’ll see there are three options:

  1. Plain text
  2. Excel spreadsheet
  3. HTML

But you can have many more… as long as they are based on these defaults.  To explain what I mean by this let’s close up the nodes and look again at the options:

Underneath the “File Types” you’ll see “Embedded Content Processors” where you can select and copy them to create as many as you need, like this for example:

Each one of these can have different rules so that you can create your custom filetypes with unique rules to solve the particular problem you need for a particular situation.  NOTE: after creating new embedded content processors they won’t be visible in the filetype for selection until you close your settings and reopen them again.  But once done you see something like this:

Now I can select the two new ones I just created when I’m configuring the XHTML filetype I referred to earlier.

Whilst the choice of filetypes to chain is quite limited, I think these three probably cover the majority of cases you’re likely to come across.  Configuring them is simple:

  1. Plain text
    1. this is exactly the same as creating a regular expression based filetype.
    2. define the structure
    3. define the inline tags
  2. Excel spreadsheet
    1. exactly the same as described in the article on taggy Excel files
  3. HTML
    1. just refine the parser rules already provided to suit your usecase
    2. for most files you won’t need to do anything at all

I deliberately haven’t gone into a lot of detail here for two reasons.  The first is because I think by now you’ll have got the idea and probably don’t need to know anything else, and secondly because I have covered these principles in some detail before.  If you review this article on custom XML (scroll down to the sub-headings on embedded content processing) you should find what you need to know.  But if there are specific questions feel free to ask in the comments below or post into the SDL Community.

Affected file types

Which ones adopt this approach?  These ones:

  • Email (EMAIL v 1.0.0.0)
  • XHTML 1.1 (2) (XHTML 1.1 v 2.0.0.0)
  • XHTML 1.1 (XHTML 1.1 v 1.2.0.0)
  • HTML 5 (Html 5 2.0.0.0)
  • JSON (JSON v 1.0.0.0)
  • YAML (YAML v 1.0.0.0)
  • Markdown (Markdown v 1.0.0.0)
  • XML 2: Microsoft .NET Resources (XML: RESX v 2.0.0.0)
  • XML: Microsoft .NET Resources (XML: RESX v 1.2.0.0)
  • XML 2: OASIS DITA 1.3 Compliant (XML: DITA 1.3 v 2.0.0.0)
  • XML: OASIS DITA 1.3 Compliant (XML: DITA 1.2 v 1.2.0.0)
  • XML 2: OASIS DocBook 4.5 Compliant (XML: DocBook 4.5 v 2.0.0.0)
  • XML: OASIS DocBook 4.5 Compliant (DocBook 4.5 v 1.2.0.0)
  • XML 2: Author-IT Compliant (XML: Author-IT 1.2 v 2.0.0.0)
  • XML 2: MadCap Compliant (XML: MadCap 1.2 v 2.0.0.0)
  • XML: MadCap Compliant (XML: MadCap 1.2 v 1.0.0.0)
  • XML 2: W3C ITS Compliant (XML: ITS 1.0 v 2.0.0.0)
  • Custom filetypes
    • HTML 5 (Html File v 2.0.0.0)
    • HTML 4 (Html File v 2.0.0.0)
    • XML 2 (XML v 2.0.0.0)
    • XML (Embedded Content) (XML v 1.3.0.0)

A few gotchas…

Scope and specifications

Sometimes there is an embedded content option, but it can be restrictive in terms of the coverage offered.  So you may need to do a little more investigative work to figure out why, if you can’t immediately see the reason, your non-translatable text is not being protected.  A good example would be Markdown files (*.md).

Here the only embedded content that can be processed is within code blocks and html blocks.  So if you are trying to handle embedded content in a Markdown file and it’s not working you first need to check whether the content has been written inside one of these objects?

    • If not then you have your answer… you need to handle the content some other way.
    • if it has then you need to make sure that the Studio Markdown filetype understands these objects in the same way you do.

What do I mean by understanding the objects?  The specification for Markdown is a little loose, and whilst I don’t believe we have documented this anywhere I think the one we follow would be this:

https://spec.commonmark.org/

The rules for code blocks ( indented code blocks and fenced code blocks) and html blocks are quite well described in here.  However, there is other documentation, equally valid:

https://www.markdownguide.org/basic-syntax/#code-blocks

But the problem here is that the rules in this other documentation are not as comprehensive and it’s very easy to create Markdown code that may work for the application intended, but Studio won’t see it that way.  I saw a good example of this a week ago where the code block, an indented code block, worked fine where it was used.  But because there wasn’t a blank line after the paragraph preceding it Studio didn’t see it that way and so the embedded content processor could not pick up the non-translatable elements.

This same “gotcha” can apply in other areas.  The message I’m trying to get across being you need to investigate in more than simply Studio before concluding the embedded content processing just doesn’t work.

Document Structure

Another common problem I have come across when users try to use these features is the use of Document Structure.  Studio uses the Document Structure to improve accurate leverage from your translation memory by adding context information to each translation unit you save.  Take this simple example where I added some markup (<strong>multifarious</strong>) into a word document:

If I open this in Studio I see this in the Document Structure column:

The right-hand column tells us what the Document Structure is… in this case a paragraph (P), list-items(LI) and table cells (TC+).  The plus symbol after TC tells us that there is actually more structure associated to this one which I can see by clicking on it:

The reason this is all important is because if I create a rule for this markup I have to specify the relevant Document Structure.  If I just create a rule for the paragraph then I get this:

Only the segment with the Paragraph Document Structure is tagged.  To get them all you need to add each applicable reference for the Document Structure:

When you do that the file will look like this:

This creates two emotions in me… first of all one of relief because now I know why my embedded content processor didn’t work!  And then a second emotion which is less positive because I don’t understand why there isn’t a catch all structure to ensure my rules apply globally and not just on specific parts of my file.  I like that this granularity is possible because it does lend itself to more complex scenarios where you might only want to tag content in a specific set of circumstances (unlikely in my opinion… but lets be generous and enjoy the sophistication), however, let’s do something simple for the majority of use cases!  I have tried testing against body and section which appear to be structural items in the underlying XML but these have no effect.

So vote for this idea!  Clearly we thought it would be a good idea and it was added into the filetype options for Trados Live… so some consistency and parity across the tools would be good!

Filetype options versus Project settings

Ah yes… that old chestnut!  Make sure, especially if you’re new to Trados Studio that you know where you are checking the settings you created.  If you have already created your project then you cannot alter the way the text has been extracted and will need to create the project again.  It’s worth reading this great article from Jerzy Czopik… Tea and Settings!

What about the ones that don’t?

If you have a need to handle embedded content in files that don’t support anything I’ve covered before then you have two options at least… translators and localization engineers have an unsurpassed ability to invent the most amazing solutions when needed so I’ll just cover the basics here:

  1. address the content in the source file
  2. use a plugin from the SDL AppStore

Address the content in the source file

One way you may be able to tackle this is through the use of non-translatable styles.  By applying a specific style for content that should not be translated in the source file you might be able to use them in the file type settings to exclude the content from translation.  For example:

This method isn’t always going to be helpful because you don’t have a lot of control over how the content should be handled.  It’s all converted to structure tags.  But if you only need to completely exclude blocks of content from translation then this can be a very effective and simple way to do it.

Use a plugin from the SDL AppStore

The best way to manage it (in my opinion), if you do need a little more control, is to address this after the project has already been created.  Coupled with the improved filter capabilities in Studio this approach can be very effective.  Worth noting that it’s probably not unusual to receive project packages where the person who created them had a very limited knowledge of how to work with filetypes and non-translatable content.  So being able to address this after the project has been prepared is very useful indeed!

There are two applications freely available on the SDL AppStore to help with this:

  1. CleanUp Tasks
  2. SDL Data Protection Suite

My preference is for the Data Protection Suite simply because I think it’s a more robust and easier solution to use.  But CleanUp Tasks does offer quite a few interesting possibilities including being able to work with tag pairs which isn’t possible using the Data Protection Suite.  I don’t intend to cover these applications in this article… it’s already longer than I originally intended (my apologies, and thanks, if you’ve made it this far!)… so if you have any specific questions feel free to ask in the comments below or post into the SDL Community.  I’d also be interested if there is anything related to the use of filetypes, or embedded content that you think could do with a separate article to clarify the details.

Affected file types

Which ones don’t handle embedded content using the methods above at all?  These ones:

  • SDL XLIFF (SDL XLIFF 1.0 v 1.0.0.0)
  • TRADOStag (TTX 2.0 v 2.0.0.0)
  • SDL Edit (ITD v 1.0.0.0)
  • SDL Trados Translator’s Workbench (Bilingual Workbench 1.0.0.0)
  • Rich Text Format (RTF) (RTF v 2.0.0.0)
  • Microsoft Visio (Visio v 1.0.0.0)
  • Adobe FrameMaker 8-2020 MIF V2 (FrameMaker v 10.0.0)
  • Adobe FrameMaker 8-2020 MIF (FrameMaker 8.0 v 2.0.0.0)
  • Adobe InDesign CS2-CS4 INX (Inx 1.0.0.0)
  • Adobe InDesign CS4-CC IDML (IDML v 1.0.0.0)
  • Adobe InCopy CS4-CC ICML (ICML Filter 1.0.0.0)
  • Adobe Photoshop (Photoshop v 1.0.0.0)
  • OpenDocument Text Document (ODT) (Odt 1.0.0.0)
  • OpenDocument Presentation (ODP) (Odp 1.0.0.0)
  • OpenDocument Spreadsheet (ODS) (Ods 1.0.0.0)
  • QuarkXPress Export (QuarkXPress v 2.0.0.0)
  • XLIFF: Kilgray MemoQ (MemoQ v 1.0.0.0)
  • PDF (PDF v 3.0.0.0)
  • Comma Delimited Text (CSV) (CSV v 2.0.0.0)
  • Tab Delimited Text (Tab Delimited v 2.0.0.0)
  • XML: W3C ITS Compliant (XML: ITS 1.0 v 1.2.0.0)
  • XML 2: Any XML (XML: Any v 2.0.0.0)
  • Custom filetypes
    • Simple Delimited Text (Delimited Text v 2.0.0.0)

Conclusion

Like with so many of the articles I write I find that the more I start looking into a topic, the more there is to talk about and it’s really hard knowing where to stop.  Certainly the labyrinth of Studio settings and file types can leave many users viewing it as a bit of a Pandora’s box.  This is quite unfortunate because the best way to learn about the capabilities of Trados Studio is to explore these things.  Just take a little bit at a time, and if you don’t understand something ask about it in the SDL Community.  Discussions around these sort of things are always really welcome… it’s not just a place to go when you have a problem!  And if you do this I can guarantee you’ll find your ability to work with any tool will be significantly improved.

Some you win… some you lose

When we released the new Trados 2021 last week I fully intended to make my first article, after the summary of the release notes, to be something based around the new appstore integration.  The number of issues we are seeing with this release are very low which is a good thing, but nonetheless I feel compelled to tackle one thing first that has come up a little in the forums.  It relates to some changes made to improve the product for the many.

Continue reading

The versatile regex based text filter in Trados Studio…

After attending the xl8cluj conference in Romania a few weeks ago, which was an excellent, and very technical conference for translators, I thought it was about time I wrote an article around the things you can do with the Regular Expression Delimited Text filter since it is so useful for solving all kinds of tasks related to text based files that don’t fit any of the out of the box formats available in the product.  Files such as software string files and csv files are common examples of where understanding how to work with this customisable file type can yield many benefits.  So this article is food for thought and a few things that might be helpful to you in the future.  It’s also pretty long (I’m not kidding!), so maybe grab a cup of coffee before you start to go through it!

Continue reading

Wot! No target!!

The origin of Chad (if you’re British) or Kilroy (if you’re American) seems largely supposition.  The most likely story I could find, or rather the one I like the most, is that it was created by the late cartoonist George Edward Chatterton ‘Chat’ in 1937 to advertise dance events at a local RAF (Royal Air Force) base.  After that Chad is remembered for bringing attention to any shortages, or shortcomings, in wartime Britain with messages like Wot! No eggs!!, and Wot! No fags!!.  It’s not used a lot these days, but for those of us aware of the symbolism it’s probably a fitting exclamation when you can’t save your target file after completing a translation in Trados Studio!  At least that would be the polite exclamation since this is one of the most frustrating scenarios you may come across!

At the start of this article I fully intended this to be a simple description of the problems around saving the target file, but like so many things I write it hasn’t turned out that way!  But I found it a useful exercise so I hope you will too.  So, let’s start simple despite that introduction because the reasons for this problem usually boil down to one or more of these three things:

  1. Not preparing the project so it’s suitable for sharing
  2. Corruption of a project file
  3. A problem with the source file or the Studio filetype

Continue reading

Priorities… paths… filetypes….

At the beginning of each year we probably all review our priorities for the New Year ahead so we have a well balanced start… use that gym membership properly, study for a new language, get accredited in some new skill, stop eating chocolate… although that may be going just a bit too far, everything is fine with a little moderation!  I have to admit that moderating chocolate isn’t, and may never be, one of my strong points even though it’s on my list again this year!  But the idea of looking at our priorities and setting them up appropriately is a good one so I thought I’d start off 2018 with a short article explaining why this is even important when using SDL Trados Studio, particularly because I see new users struggling with, or just not being aware of, the concepts around the prioritisation of filetypes.  If you don’t understand them then you can find code doesn’t get tagged correctly despite you setting it up, or non-translatable text is always getting extracted for translation even though you’re sure you excluded it, or even files being completely mishandled. Continue reading

Double vision!!

There are well over 200 applications in the SDL AppStore and the vast majority are free.  I think many users only look at the free apps, and I couldn’t blame them for that as I sometimes do the same thing when it comes to mobile apps.  But every now and again I find something that I would have to pay for but it just looks too useful to ignore.  The same logic applies to the SDL AppStore and there are some developers creating some marvellous solutions that are not free.  So this is the first of a number of articles I’m planning to write about the paid applications, some of them costing only a few euros and others a little more. Are they worth the money?  I think the developers deserve to be paid for the effort they’ve gone to but I’ll let you be the judge of that and I’ll begin by explaining why this article is called double vision!!

From time to time I see translators asking how they can get target documents (the translated version) that are fully formatted but contain the source and the target text… so doubling up on the text that’s required.  I’ve seen all kinds of workarounds ranging from copy and paste to using an auto hotkey script that grabs the text from the source segment and pastes it into the target every time you confirm a translation. It’s a bit of an odd requirement but since we do see it, it’s good to know there is a way to handle it. But perhaps a better way to handle it now would be to use the “RyS Enhanced Target Document Generator” app from the SDL AppStore? Continue reading

Iris Optical Character Recognition

I’m back on the topic of PDF support!  I have written about this a few times in the past with “I thought Studio could handle a PDF?” and “Handling PDFs… is there a best way?“, and this could give people the impression I’m a fan of translating PDF files.  But I’m not!  If I was asked to handle PDF files for translation I’d do everything I could to get hold of the original source file that was used to create the PDF because this is always going to be a better solution.  But the reality of life for many translators is that getting the original source file is not always an option.  I was fortunate enough to be able to attend the FIT Conference in Brisbane a few weeks ago and I was surprised at how many freelance translators and agencies I met dealt with large volumes of PDF files from all over the world, often coming from hospitals where the content was a mixture of typed and handwritten material, and almost always on a 24-hr turnaround.  The process of dealing with these files is really tricky and normally involves using Optical Character Recognition (OCR) software such as Abbyy Finereader to get the content into Microsoft Word and then a tidy up exercise in Word.  All of this takes so long it’s sometimes easier to just recreate the files in Word and translate them as you go!  Translate in Word…sacrilege to my ears!  But this is reality and looking at some of the examples of files I was given there are times when I think I’d even recommend working that way!

Continue reading

Cutie Cat?

A nice picture of a cutie cat… although I’m really looking for a cutie linguist and didn’t think it would be appropriate to share my vision for that!  More seriously the truth isn’t as risqué… I’m really after Qt Linguist.  Now maybe you come across this more often than I do so the solutions for dealing with files from the Qt product, often shared as *.TS files, may simply role off your tongue.  I think the first time I saw them I just looked at the format with a text editor, saw they looked pretty simple and created a custom filetype to deal with them in Studio 2009.  Since that date I’ve only been asked a handful of times so I don’t think about this a lot… in fact the cutie cat would get more attention!  But in the last few weeks I’ve been asked four times by different people and I’ve seen a question on proZ so I thought it may be worth looking a little deeper.

Continue reading

All that glitters is not gold…

001Years ago, when I was still in the Army, there was a saying that we used to live by for routine inspections.  “If it looks right, it is right”… or perhaps more fittingly “bullshit baffles brains”.  These were really all about making sure that you knew what had to be addressed in order to satisfy an often trivial inspection, and to a large extent this approach worked as long as nobody dug a little deeper to get at the truth.  This approach is not limited to the Army however, and today it’s easy to create a polished website, make statements with plenty of smiling users, offer something for free and then share it all over social media.  But what is different today is that there is potential to reach tens of thousands of people and not all of them will dig a little deeper… so the potential for reward is high, and the potential for disappointment is similarly high.

Continue reading

Getting a filetype preview…

001One of my favourite features in Studio 2017 is the filetype preview.  The time it can save when you are creating custom filetypes comes from the fun in using it.  I can fill out all the rules and switch between the preview and the rules editor without having to continually close the options, open the file, see if it worked and then close the file and go back to the options again… then repeat from the start… again… and again…   I guess it’s the little things that keep us happy!

I decided to look at this using a YAML file as this seems to be coming up quite a bit recently.  YAML, pronounced “Camel”, stands for “YAML Ain’t Markup Language” and I believe it’s a superset of the JSON format, but with the goal of making it more human readable.  The specification for YAML is here, YAML Specification, and to do a really thorough job I guess I could try and follow the rules set out.  But in practice I’ve found that creating a simple Regular Expression Delimited Text filetype based on the sample files I’ve seen has been the key to handling this format.  Looking ahead I think it would be useful to see a filetype created either as a plugin through the SDL AppStore, or within the core product just to make it easier for users not comfortable with creating their own filetypes.  But I digress…

Continue reading