Unfortunately the practice of being asked to translate a Microsoft Word file that contains HTML code doesn’t look as though it will go away any time soon for some translators. But it’s not the end of the world and it’s often all in the preparation of the Word file before you translate it.
This article is just a short.. ish one I decided to write after seeing this come up again in ProZ last week, and because it’s another place where all those lovely regular expressions we’re learning about can come in handy. Yes, Microsoft Word also supports regular expressions, although it is their own flavour. You can read more about this by just googling for “regular expressions in Microsoft Word” and you will find plenty of help on the subject. In Word they are called wildcards but they have many similar principles as we’ll see with this very simple example.
I have a Word file that looks like this and you can see I have added what’s often referred to as embedded HTML copied in as text:
If I open this ins Studio I get this which is not too easy to work with. Hardly surprising though as this is a terrible way to handle content like this… actually if anyone can tell me why people do it I’d be interested to learn!:
So the solution for Studio users is to do one of two things:
- Copy the html into a decent text editor, save as html, and then use Studio to handle the html separately, or
- Use a little regex magic to replace all the tags as hidden text so they can’t been seen in Studio
For this article I’m going to use the latter and search and replace the tags with the hidden formatting property in Word. Sometimes this is an easier approach for files with embedded content like this because the HTML may be scattered all over the place so this is one operation rather than many. To do this I’ll use the following expression to find the tags:
So very similar to .NET flavour of regex that Studio uses but this has a slightly different meaning. Word uses the angle brackets to mark the start and end of a word so that you can find single words only… sort of like word boundary markers in .NET. I actually want to find the angle brackets so I have to escape them and this is what the backslash does. The star symbol is exactly the same as .NET, it just means find anything. So in my Word find and replace dialogue I set it up like this:
- I enter my regular expression
- I check the “Use Wildcards” checkbox
- I click on “Format”, then “Font” and in there click on “Hidden”
You can see just beneath the search pattern and beneath the empty replace box it tells me what settings I used for each. Now all I do is click on “Replace All”. Immediately all my tags have disappeared and the Word file looks like this:
But don’t worry… if I click the display formatting button it all comes back again… so the button shown here on the right. The text will now have dotted lines under it but this just tells you that it has the hidden font properties so I can simply set the option in Studio not to extract hidden text for translation. You can find this option here under the “Common” node in the filetype settings for Microsoft Word:
Now when I open the file for translation I see this:
Much easier to handle, all the HTML code is hidden, and I can safely handle the file.
In reality this is an exercise in seeing yet another application for regular expressions in other software tools…. this time Microsoft Office… because I truly hope you don’t see any files like this at all. But if you do, as I do occasionally see, then perhaps this article will be helpful for you in having to safely navigate the content of the file without destroying the tags.
Once you are done you select the text in the target file, right click and select font, then unhide the hidden text. Simple!