Updated 15 January 2015 : Only 10,000 TUs are required for the generation of an AutoSuggest dictionary with Studio 2014.
I’ve been talking to a Freelance Translator in Canada over the last few weeks who purchased Studio 2011. She has a great set of resources from many years of translating, all split up in different sublanguages to cater for en(US), en(GB) and fr(FR), fr(CA) variations. What she didn’t have was consolidated Translation Memories so she could maximise her leverage from all of these variations, or Autosuggest dictionaries, or termbases and didn’t use the AutoText lists.
The AutoText is more of a personal preference I think, but the Translation Memories could be put to good use, a Termbase is always a good idea (follow @jaynefox to see why as she is writing some blogs on this subject) and because her language pairs are flavours of common languages there are also some good free resources you can leverage in case you don’t have sufficient Translation Units yet to make the required 25,000 (10,000 if you have Studio 2014) to generate AutoSuggest Dictionaries.
In this case our Translator had plenty of TUs, nearly a million when they were all added together but I decided to have a play with the free resources available via the DGT Multilingual Translation Memory of the Acquis Communautaire. This is a collection of Translation Memories that represent the entire body of European Legislation across 22 languages…. in fact it is rapidly approaching a combined collection of some 40 million Translation Units.
So, after downloading the program and dll provided by the DGT, then all 25 volumes of data, I set about extracting the language pair I wanted, English to French (the screenshot below is the program downloaded from the DGT):
I actually used all 25 volumes for this and after a few minutes… well a little more than a few.. I had a TMX file containing some 1.9 million Translation Units. Easily over the 25 thousand I needed for an AutoSuggest Dictionary. Now I actually wanted en(US) to fr(CA) for this exercise. I know there will be regional differences, but as an additional resource to help get more from Studio I think using this for an AutoSuggest Dictionary and even as a reference Translation Memory will be helpful. So I opened the TMX in an editor and replaced all the references to FR-FR with fr-CA and all the EN-GB with en-US… like this:
The next thing I did, because I really have no idea about what is in this TMX is use Studio to clean it up a bit. So I created a new Translation Memory in Studio and then imported this TMX into it. This process gets rid of duplicates, mostly where Studio uses different recognizers for numbers, date/time expressions, measurements etc. and does not require as many as Workbench, or many other tools would; the import can also remove some invalid units where parts of the XML have gone missing or is invalid. So really no surprise that after importing the 1.9 million I was left with a Translation Memory that only contained 1.2 Million.
So, the next step was to convert all the Translation Memories that the translator provided into TMX as well and then merge them all together. I could do this using the upgrade route in Studio and not have to bother with TMX at all, but I actually want the TMX this time. This is because I want to join the translators TMX to the DGT TMX and see whether I can create a performant AutoSuggest… and it’s easier to manually add the TMX files together, or rather faster to copy and paste the Translation Units, than use the merge process and export. So to make the export of the translators memories to TMX easy I used the SDL Translation Management Utility from the OpenExchange. This allowed me to handle all eight translation memories our translator provided in one go and quickly convert to TMX. Then I saved the DGT TMX with a new name and copied all the TUs from the translator into it… so took the units between these lines:
I now have three TMX files and one SDLTM (the studio Translation Memory):
- TMX – the translators consolidated TMs
- TMX – the “cleaned” DGT TM
- TMX – a combined TMX containing 1. and 2.
- SDLTM – a Studio TM based on the DGT for reference when translating
The next step is to create the AutoSuggest Dictionaries. I decided to create two, one based on 1. and one based on 3. This was just in case a large AutoSuggest performed slowly when translating. The small one took 20 minutes or so, but the large one took considerably longer… several hours in fact. But it was still successful… I’m showing the hightlights here to give you an idea of what to expect when you create a large AutoSuggest and it looks as though it’s not doing anything for a while (it’ll get there in the end..!)
First of all you get to decide how much of your TM will be processed… so if you find it’s too large for your computer to handle you can reduce the work it has to do. I played with the entire TM just to see if it would work as this is the largest TM I have ever used to create an AutoSuggest Dictionary:
My laptop then ran for several hours whilst I was working on other things (good reason to have plenty of RAM..!) and went through these stages:
Several hours later it was done. I ran a quick test on some likely material from Wikipedia – http://en.wikipedia.org/wiki/Canadian_Charter_of_Rights_and_Freedoms – using only this large AutoSuggest Dictionary:
Looks pretty good… I type one letter and get the translation for the entire first segment at the top of the list… it was also instantaneous so an encouraging start for such a large Dictionary and I hope a worthwhile exercise for our Translator too..!
Next steps termbases and autotext lists…