Studio Tips

I’m back on the topic of PDF support!  I have written about this a few times in the past with “I thought Studio could handle a PDF?” and “Handling PDFs… is there a best way?“, and this could give people the impression I’m a fan of translating PDF files.  But I’m not!  If I was asked to handle PDF files for translation I’d do everything I could to get hold of the original source file that was used to create the PDF because this is always going to be a better solution.  But the reality of life for many translators is that getting the original source file is not always an option.  I was fortunate enough to be able to attend the FIT Conference in Brisbane a few weeks ago and I was surprised at how many freelance translators and agencies I met dealt with large volumes of PDF files from all over the world, often coming from hospitals where the content was a mixture of typed and handwritten material, and almost always on a 24-hr turnaround.  The process of dealing with these files is really tricky and normally involves using Optical Character Recognition (OCR) software such as Abbyy Finereader to get the content into Microsoft Word and then a tidy up exercise in Word.  All of this takes so long it’s sometimes easier to just recreate the files in Word and translate them as you go!  Translate in Word…sacrilege to my ears!  But this is reality and looking at some of the examples of files I was given there are times when I think I’d even recommend working that way!

But there were files I saw that looked as though they should be possible to handle in a proper translation environment.  We tested a few and the results were more often than not pretty poor.  So even though we could open them up it was still better to take the DOCX that Studio creates when you open a PDF and then tidy up the Word file for translation.  At least this is some progress… now we’re able to handle the content in a translation environment and not have to recreate the entire file.  But it would be even better if the OCR software could make a better job of it.  And this is where I want to get to… better OCR!

SDL Trados Studio 2017 continued to provide the same PDF filetype that uses technology from SolidDocuments in earlier versions of Studio, and this does a fairly good job of extracting the translatable text with OCR for many files.  But it could use improvement.  SDL Trados Studio 2017 SR1 has introduced another option for OCR using a software called ReadIris that is part of the Canon Group.

Out of the box, according to the documentation, Iris supports 134 languages for OCR which is pretty impressive.  They don’t quite match the languages supported by Studio however, but a rough count and compare suggests there are some 95 shared languages… and they even support Haitian Creole which Studio does not as we know 😉  Still impressive however and it easily beats the 14 languages supported by Solid Documents in Studio 2017 prior to the introduction of Iris.  Additionally this opens the possibilities for handling scanned PDF files in Asian languages, Arabic, Hebrew and many others that were previously difficult, if not impossible, to handle.

Using the new options

So let’s take a look at where you can find this new option and how you use it.  First of all you need to go to your options:

File -> Options -> File Types -> PDF

Then navigate down to “Converter“.  Down near the bottom you’ll see the “Recognize PDF text” group as shown below and the option to activate this new feature is at the end:

Check the box and you’ll be presented with this screen:

It’s an App!  You may be wondering why you need to do this and why it was not just integrated into Studio?  The reason is simple… not everyone will want this option and the underlying software requires a 150Mb download which would have increased the size of the Studio installer to over half a gigabyte.  So it was made optional.  If you want it you click on the “Visit AppStore” link in the message above, or the one I just wrote, and download and install the plugin just as you would any plugin from the appstore.  If you don’t do this then Studio won’t be using the software.  There are no warnings, and the option remains checked, but you won’t be using it.  So when I open the Chinese PDF I just created by copying some text as an image and saving it to a PDF all I’ll get is this:

None of the text is extracted for translation at all.  But if I install the plugin and try again I see this:

Now we’re cooking!  Would be useful to get rid of the tags though as these seem to be aesthetic only, just colours and font changes where the OCR picked up a few minor differences and then introduced tags to control them.  As these are formatting tags only I could just ignore then, or press Ctrl+Shift+H to hide them in the editor.  But if I want to remove them altogether I can do this with another app. called Cleanup Tasks that I have written about before.  These three options do the job for this file:

Now I have this and can translate without any tags at all:

Nice… and if all of that sounds complicated it wasn’t really.  I created a short…ish video below putting this all together so you have an idea of how it works.

Approx. length : 16.26 mins

After all of that I don’t want you to get the impression I’m a converted believer in the possibilities of PDF translation… I’m not.  We’re unlikely to see the back of PDFs for translation any time soon, so I am happy to see the technology to support this workflow improving all the time.  I also don’t want to give the impression this is going to help with every PDF you ever see.  It won’t!  The problems of PDF quality don’t go away because of the way they been created in the first place, so source is always best.  You’re also quite likely to find PDFs you can’t handle even with Iris, and you might even find that the more basic option without Iris does a better job of your PDF conversion.  So it’s horses for courses… you have the tools and can apply the most appropriate one for your job.

If you have any questions after reading this post or watching the video then I’d recommend you visit the SDL Community and ask in there… or just post into the comments below.

A nice picture of a cutie cat… although I’m really looking for a cutie linguist and didn’t think it would be appropriate to share my vision for that!  More seriously the truth isn’t as risqué… I’m really after Qt Linguist.  Now maybe you come across this more often than I do so the solutions for dealing with files from the Qt product, often shared as *.TS files, may simply role off your tongue.  I think the first time I saw them I just looked at the format with a text editor, saw they looked pretty simple and created a custom filetype to deal with them in Studio 2009.  Since that date I’ve only been asked a handful of times so I don’t think about this a lot… in fact the cutie cat would get more attention!  But in the last few weeks I’ve been asked four times by different people and I’ve seen a question on proZ so I thought it may be worth looking a little deeper.

Read More

001Not Marvel Comics, but rather the number four which does have some pretty interesting properties.  It’s the only cardinal number in the English language to have the same number of letters as its value; in Buddhism there are four noble truths; in Harry Potter there are four Houses of Hogwarts; humans have four canines and four wisdom teeth; in chemistry there are four basic states of matter… but more importantly, for translators using Studio 2017 there are four ways, out of the box, to get started!

Now with that very tenuous link let’s get to the point.  Four ways to start translating, all of them pretty easy but they all have their pros and cons.  So getting to grips with this from the start is going to help you decide which is best for you.  First of all what are they?

  1. Translate single document
  2. Create a project
  3. Drag and drop your files
  4. Right-click and “Translate in SDL Trados Studio”

And now we know what they are should you use one process for all, or can you mix and match?  I mix and match all the time, mainly between 1. and 2. but let’s look at the differences first and you can make your own mind up.

Read More

001It’s been a while since I wrote anything about the SDLXLIFF Toolkit.. in fact I haven’t done since it was first released with the 2014 version of Studio.  Now that we have added a few new things such as SDLPLUGINS so that apps are better integrated and can be more easily distributed with Studio we have launched a new version of the toolkit for Studio 2017.  What’s new?  To be honest not a lot, but there are a couple of things that I think warrant this visit.

First of all, the app is now a plugin and this means it loads faster, is always available and there are a few tricks to being able to get the most from this.  Secondly, there are a few fixes to the search & replace features that make it possible to complete tasks that Studio will fail with and to do this the API team completely rebuilt the regex engine.  So whilst you won’t see too many changes, there are a few under the hood.

The best way to illustrate this is to show you so I have created a short video below where I have tried to explain how best to use the toolkit now it’s a plugin and not a standalone application, and I used the problems described below to demonstrate how it works.  If you want to know what else it can do I have reproduced part of the original guide below the video as that seems to have been lost over the years.  This might be helpful for a few of the more obscure features you may not have realised were possible.

Read More

001One of my favourite features in Studio 2017 is the filetype preview.  The time it can save when you are creating custom filetypes comes from the fun in using it.  I can fill out all the rules and switch between the preview and the rules editor without having to continually close the options, open the file, see if it worked and then close the file and go back to the options again… then repeat from the start… again… and again…   I guess it’s the little things that keep us happy!

I decided to look at this using a YAML file as this seems to be coming up quite a bit recently.  YAML, pronounced “Camel”, stands for “YAML Ain’t Markup Language” and I believe it’s a superset of the JSON format, but with the goal of making it more human readable.  The specification for YAML is here, YAML Specification, and to do a really thorough job I guess I could try and follow the rules set out.  But in practice I’ve found that creating a simple Regular Expression Delimited Text filetype based on the sample files I’ve seen has been the key to handling this format.  Looking ahead I think it would be useful to see a filetype created either as a plugin through the SDL AppStore, or within the core product just to make it easier for users not comfortable with creating their own filetypes.  But I digress…

Read More

001Ever since Trados came about one of the most requested features for translators has been merging across hard returns, or paragraph breaks.  Certainly for handling the translation it makes a lot of sense to be able to merge fragments of a sentence that should clearly be in one, but despite this it’s never been possible.  Why is this?  You can be sure this question has come up every year and whilst everyone agrees it would be great to have this capability, Trados has not supported it through the product.  The reason for the reluctance is that when you merge a paragraph unit (the name given to translation units separated by a paragraph break) you probably need to be able to decide how this change to the structure of the file should be handled in the target document.  Sometimes this might be simple, other times it might not be, and the framework that Trados products use is not designed in a way that supports the ability to alter the look and feel of the target file across every filetype the product can open.  Even the release of the Studio suite of products still uses the same basic idea of being able to handle the bilingual files directly rather than importing them into a black box and whilst this does offer many advantages, this problem of merging over paragraph units remains… until now.

Read More

001Wow… how time flies!  Over three years ago I wrote an article called AutoCorrect… for everything! which explained how to use AutoHotkey so you had a similar functionality to Microsoft Word for autocorrect, except it worked in all your windows applications.  This was, and still is, pretty cool I think and I still use autohotkey today for many things, and not just autocorrect.  Since writing that article we released Studio 2015, and in fact Studio 2017 is just around the corner, so it was a while back and some things have moved on.  For example, Studio 2015 introduced an autocorrect feature into Studio which meant things should be easier for all Studio users, especially if they had not come across autohotkey before.

Read More

%d bloggers like this: