When the developer of the Word Cloud plugin for SDL Trados Studio first showed me the application he developed I was pretty impressed… mainly because it just looked so cool, but also because I could think of a couple of useful applications for it.
- You could see at a glance what the content of the project was and how interesting it might be for you
- It looks cool… or did I say that already?
Actually if I’m honest I never got any further than thinking about those two things so this application was a kind of “head in the clouds” app that was almost an interesting experiment that seemed a good idea but we’re not 100% sure why. But there is one more interesting feature to this plugin which is that you can click on the words and it tells you how many occurrences of them there are. This is interesting because if you had a termbase before you start that contained all these terms then you have consistency of translation as well as autosuggest capability which could be quite useful. So it’s a sort of term extraction tool without you really having to do any work at all. Well it would be a term extraction tool if you could get them out!
So I looked around to see where the file was that held this information after you created the word cloud and interestingly enough it’s saved in the same folder as the project… and even more interestingly it’s a nice simple XML file! In fact it looks like this:
<?xml version="1.0" encoding="utf-8"?> <wordcloud> <hash>1</hash> <words> <word text="Advisor" count="43" /> <word text="Company" count="30" /> <word text="Construction" count="26" /> <word text="flowers" count="1" /> </words> </wordcloud>
The actual file is much bigger than this as you’d expect but the format is repeated all the way through. You get the word as an attribute followed by the word count as an attribute. This is perfect… I guess you can see where I’m going with this now? I can create a simple xml filetype for Studio that can do two things:
- Extract all the words for translation
- Only extract those that are above a certain value
I added the second point because you might not be interested in all the words that are not repeated… you might be, but you might not. So if I create the possibility to set this value in the filetype you can make your own mind up and the filetype becomes very useful. So, what two rules do I need for this, and do I even need two?
The first one to extract the words from the text attribute is simple enough:
So this just uses XPath to extract the words from the text attribute. To set the count I can add this into the same expression like this:
So this just means only take the word elements that have a count value greater than 5 (you can change this to whatever you like… 0 if you want everything, or omit the count part from the rule), and then just take the contents of the text attribute. Simple, and now I have this as my filetype parser rules. I added the //* out of habit to ensure nothing else is parsed… you don’t really need it at all in this case:
So now I translate the file in Studio. When I’ve done this, keeping in mind the end goal here is a termbase, I need to convert the SDLXLIFF to a TMX (unless the developer of the Glossary Converter adds SDLXLIFF to the convertible file formats ;-)) because from there I can easily create the termbase. Conveniently there is an app on the OpenExchange called SDLXliff2Tmx which will allow me to convert an SDLXLIFF to TMX with a drag and drop.
So the process is OpenExchange all the way… with a little translation along the way.
Wordcloud -> XML -> SDLXLIFF -> TMX -> SDLTB
Now if all that sounds complicated it’s not… here’s a short video to explain the process:
So the Wordcloud plugin has a surprising benefit after all… it’s also a free term extraction tool that takes no effort at all and allows you to create a Project termbase before you start your work. Very cool! One last thing… if you want more information on how to use XPath, or how to create custom XML filetypes, you can find a couple of articles here which might be useful: