A little Learning is a dang’rous Thing;

01Drink deep, or taste not the Pierian Spring:
There shallow Draughts intoxicate the Brain,
And drinking largely sobers us again.

I’m quoting Alexander Pope in 1709, rightly or wrongly, for hitting the nail on the head when it comes to the truly intoxicating mix of language and technology.  A little knowledge is indeed a dangerous thing and it’s something I know I’ve been guilty of all my life… I learn a little something new and now I’m an expert.  That is of course until I learn a bit more, and then a little more after that, and before I know it I realise I know nothing at all!  Translation technology is great for dropping us all into this trap… Trados user since Trados 5, translator for over 20-years… can handle any type of file.  Falling into this trap is pretty easy in fact, especially when the tools available for translation today take a lot of the effort out of the tasks at hand.  But not everything is what it seems and sometimes it takes a mistake or three to sober us up again!  There’s a reason why well organised and successful translation companies, dealing in all kinds of content, have Project Managers, Translators and Localization Engineers within their midst.

To explain this better I’m going to tell you a little story of how I spent a couple of evenings during an SDL Roadshow trip to Warsaw and then Budapest this week working on a process related problem with an agency in Canada.  It was actually quite an interesting time and let me play around with a few things I like to pretend I know about.  Be warned now… it’s quite a long post so if you’d rather just watch the movie scroll to the bottom!

The Problem

The problem started with the client providing an XML for translation to the Project Manager, who then passed it straight over to the translator after asking if they were ok workiing with XML.  Studio, like most CAT tools, can of course handle an XML file, but as we know this is just the start of the story with XML.  The translator opened the XML, translated all the English text into German and then sent the translated XML file back to the Project Manager and onto the client.  The client wasn’t happy! Why?

Let’s start with the original XML, which looked like this:

02

If you know about XML you are probably thinking, ok, it looks to me as though the German translation should be placed into the target element.  This would be a reasonable assumption because if you knew this then you’d also know that if you translate this XML using the simple default XML filetype then you’d probably end up with something looking like this:

03

Clearly wrong because the German translation has just overwritten the English source!  In order for this to work when using a monolingual XML filetype you must ensure the text to be translated is in the target element (in this example) before you start so you can correctly replace this.  If the file isn’t partially translated already then this is fairly straightforward using a regex search and replace.

So search for this (I’ll make one, two very important points here.  First this expression was created for this particular file… if you have a file you wish to do this with it may not be the same layout and will most likely require a different expression. Second, always back up your file before you do it!):

^(\s+<source xml:lang=".*?">)(.*?)(</source>\s+<target xml:lang=".*?">)
(</target>)

And replace with this:

$1$2$3$2$4

I used Notepad++ for this making sure that the “. matches newline” option was checked in addition to “use regular expressions” because the file contained a mix of windows style linefeeds and unix style carriage returns (I’m grateful for an explanation from Jan Goyvaerts on this problem because I was messing around with \n\r and \s in the expression to make it work originally… I did manage it but the expression was messier). All we’re doing is looking for all the source and target elements and capturing them into four back references. Then replace them in the same order but adding the second back reference (containing the source text) twice so it’s also placed into the target element. This takes seconds and now I have this:

04

If the translator had received the XML file prepared like this it would have been more obvious what was required, and this leads me back a step, because when I open this in Studio I see this:

05

Note how the text is duplicated and we see the source text twice. This is because the AnyXML filetype, which is the default XML filetype for handling monolingual XML files will extract the text from ALL the elements it finds. This is the correct action for the AnyXML filetype, but it’s not what we want to do here. Unfortunately the translator, who is probably an excellent linguist and probably knows very well how to work with the translation tool, but doesn’t know enough about what’s happenning under the hood or why it’s important to handle XML with particular care. So a case of knowing just enough to get into a bit of trouble, but not enough to be able to recognise that a scenario like this needs handling in a different way.

Back to our story though. The translator sent the translated SDLXLIFF and the translated XML to the Project Manager, with neither of them knowing what was wrong. The client of course sees the problem immediately and asks the Project Manager to resolve it.  The translator believes the job is done because the file was translated and the Project Manager just knows he has a translator who wants to be paid, and a client who is unhappy and may not be prepared to push any more work his way!

The Resolution

So the start of this story, for me, was receiving the translated SDLXLIFF and the translated XML.  How do we resolve this given we also don’t have the original XML (it wasn’t included and no time to explain and then wait for it to be provided).

Not too tricky in fact… but here’s the first steps:

  1. Open the translated SDLXLIFF in Studio.
  2. Use the Advanced Save option, in the File menu to save the source file as opposed to the target file… a neat option in Studio.
  3. Open the XML in Notepad++ and make sure there are no target elements already containing translated text.  I used this simple regex to look for any that were populated:
    <target xml:lang=”.+?”>.+</target>
  4. Once happy there were none I could apply the regex search and replace above and populate the XML target elements with the source text.

All simple stuff so far. But now I must make sure I don’t have the text in the source and the target elements extracted for translation which leads to doubling of the wordcount as I showed before and of course overwriting the source and target elements with the translation. To do this I just need to create a custom filetype.  I won’t go into detail here as I’ve discussed this many times before, so if you’re interested refer to this article, although I will show the overall process in a video at the end.  But it’s pretty simple for this file, two rules:

06

This will ensure that I only extract the text in the target element for translation.  So what I get in Studio is exactly what the translator saw, but the difference being I’m now going to translate the text in the target element and not the text in the source.  I also added come context because after doing this for real the project manager noted that there was also embedded code (%s) in the midst of the thousands of lines and the translator sometimes mistakingly handled it as translatable text, or missed it out of the translation altogether, so the translation contains this sort of thing:

07

There was also stuff like this in the file that is easily mishandled, and in this case probably translated when it should not have been changed at all, so I created tags for these too which ensure they are excluded from the translation altogether:

08

There were a couple of others, and some that just look plain wrong in the source file, but I ensured they remained as provided although given more time and the relationship wth the end client I think it would make sense to question some of it in case it was a mistake they need to correct.  So I had these rules:

09

Now when I open the file in Studio it looks better :

10

It’s also safer because not only will the translator see these are tags, but Studio carries out an automated check to see if they are missing.  These things don’t happen if you treat the content as text.  The other tags with the curly brackets were all external so these were moved safely out of the translation altogether ensuring a mistake cannot be made.

Unfortunately, due to the urgency of the request from the Agency, and my lack of availability late at night inbetween roadshows, I didn’t inspect the text in detail as I have done now.  I just recovered the situation so the Agency was able to generate XML target files that contain source and target translations as the client intended, based on the content of the original translated SDLXLIFF I received.  So they have had to carry out a manual check to ensure all the code that should not have been touched, was written into the target XML correctly.

But this still leaves an interesting piece of the puzzle.  How to recover the situation now that I have the correct XML file, and a customised XML filetype to work from?  To do this I first created a Translation Memory from the SDLXLIFF provided and just pretranslated the new SDLXLIFF I created.  But unfortunately this wasn’t good enough because the original contained duplicate source segments with multiple translations and these all incurred a penalty leaving me with two, or sometimes more choices over which one was the right one to select.  Like this for example:

11

I highlighted the differences in yellow, all things the out of the box QA check could have found in Studio, but also all things preventing me from making sure my file was the same as the one the translator provided but able to deliver the correctly formed XML target file.  So I adopted a different approach, and used Perfect Match to get the status of my SDLXLIFF matching the original linguistically, and then my task was complete.  This worked perfectly, and the file was matched so I had exactly the same translations segment by segment allowing me to save the target and return an XML file like this:

12

Now, in this example I’m showing here where I additionally created tags to protect code in the file the Perfect Match operation left me with fuzzy matches and not Perfect Matches.  This is because the source has changed… I’ve introduced tags.  But this is easily solved manually by creating a Translation Memory from the original SDLXLIFF and using fuzzy matching to compete the task with the tags correctly placed.

All in all I was quite pleased with the features in Studio that made it possible for me to do this, and I kind of enjoyed the challenge and the learning experience… I also thought it would make a nice case study to share as it contains lots of useful lessons that I hope benefit others too.

The Moral

Is there one?  Well apart from the improved understanding I have gained for Project Managers and Translators who find themselves in a position like this, and the obvious stress this creates between all parties, including the client, there is a moral to this, or probably several.  The first is that it’s not always enough to be an experienced translator or project manager.  Today the filetypes you could be asked to handle require a better understanding of the scope of the work than just how many words you think it is.  Even wordcount differences in less challenging filetypes can cause disagreement and confusion, but with XML you have to remember that these filetypes can contain translatable text in elements and/or attributes and they can contain conditional translation where you only translate the text under certain circumstances.   If you assume you know enough about translating in any CAT tool without getting the answers to what the scope of work is before you start then you could be heading down a log and painful path leading to excessive work without payment and/or unhappy clients.

When handling XML files always ensure the following:

  1. That the client has prepared the XML files for translation and given you clear instructions on what needs to be translated, or
  2. That you (Project Manager or Translator) understand the requirements and what must be translated before you start

If it’s the latter then my advice would be to either employ a localization engineer with the appropriate skills to prepare files for the translators if you do not know how to do this yourself, or only give work of this nature to translators with proven experience in this field.  If you aren’t convinced then here’s a little light reading on this sort of topic to explain why it’s different to handling Word files!

These are not exhaustive, but I hope they shed some light on the sort of technical skills you need somewhere in the mix of client, project manager and translator.

One last thing I’ll mention… these particular files could also be resolved another, possibly easier way and it’s a final moral to this tale.  Always make sure the person you ask to do these things is experienced enough to see the most appropriate way to handle the files from the start!  If you notice when you watch the video these files actually have an XLIFF body embedded into the XML.  So if you remove the XML declaration at the start of the files and then add an XLIFF extension (or add XML to the XLIFF filetype in Studio), then you can open them and save them then using the XLIFF filetype as true bilingual files.  The result looking something like this:

13

You’ll see you do get an XLIFF target state added to the files, but you could remove these afterwards if they were a problem and put the XML declaration back if needed.  You also wouldn’t be able to handle the embedded content in the way I showed it in this article, but still a simpler solution.  So whilst it was great fun to play around this way, you can see that asking me, and I’m not a localization engineer, can often lead to enjoyable but tricky workarounds when the most appropriate solution is almost under your nose!!  A little learning is indeed a dangerous thing!!

XML… the Movie

Length approx. 30 minutes

8 comments
  1. Evzen said:

    Ummm, the original file actually seems to be an XLIFF!
    So why this funny process?
    Or is it just XLIFF-like file, but not a real XLIFF?

    Like

    • matroz said:

      Thank you Paul, I agree 1,000% with you. As it looks like every day is a good day for a new CMS or xml filetype, being extremely aware on what needs to be exposed to translation and how to do it is just vital.
      Best,
      Matteo

      Like

  2. Evzen said:

    nevermind, I should better read the COMPLETE article first… :-\

    Like

    • Indeed 😉 You are right of course but this is really what the article is all about. I like to show interesting stuff in Studio (at least I think it is!) and this was a real problem caused by preparing a file that could have been handled as XLIFF in the first place, but it wasn’t. So working from the end result backwards taking advantage of some nice features and a little interesting regex exercise was fun, and I hope interesting.
      If I had started from XLIFF I could not have used Perfect Match to recover the translation anyway, so I was forced to make the same mistake, but do it properly… so to speak! I think all experiences are quite good to share, even if the scenario as I painted it doesn’t happen very often… or does it?

      Like

  3. Robert Bevington said:

    Hi Paul,

    this once more shows how powerful RegEx expressions really are, but I fear it could scare of a few “normal” translators and project managers. Passolo has a really neat feature that basically does the same thing. In the XML parser you can tell the parser to add the translated text to a different element, very similar to your initial problem. This is a very handy feature and is probably a bit more intuive than RegEx. So maybe Studio could adopt this feature sometime.

    Best regards,

    Robert

    Like

    • Hi Robert, I agree with you that we should have a customisable XML filetype in Studio to handle these types of things more easily, similar to Passolo. In reality, with this particular filetype XLIFF would have been the easiest approach from the start, but I’d like to see a multilingual XML capability in Studio anyway and then you’d have the choice and flexibility for similar problems that were not really embedded XLIFF files. Might be something to look at via the OpenX to begin with. I think there is even some sample code for this in the API documentation.

      Like

  4. Carla said:

    Hi Paul,

    Very interesting article! Thank you for sharing! I thought it would help me find a solution to my XML problem but I’m not sure…

    Do you know how to convert several monolingual XML files into one multilingual XML file by using XSLT? I received an English XML file to be translated into several target languages but my client needed the file back with all the translations included in it. The problem when using a CAT tool (WorldServer and Studio 2014), is that we get several separate target XML files and not a multilingual XML one.

    The structure of the source English XML file looks like this:

    What do you think about this app?

    How satisfied are you with Linksys Smart Wi-Fi?

    Thank you for your help!

    Best regards,

    Carla

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: