More Regex? No, it’s time for something completely different.

#02Now that we’ve learned enough about regular expressions, and because I get so many requests for custom filetypes I thought it might be useful to take a dip into the world of XPath.  So what exactly is XPath?

Well as far as most CAT tools go it probably is something completely different… certainly it was not used in the old Trados days.  But as a tool it’s nothing new and is simply a language used to find parts of an XML document and what’s more it’s a language that is recommended by the World Wide Web Consortium W3C.  So there is nothing proprietary here.

If you did dive in and start to look at this documentation I referred to it may, unless you lean towards the technical side, be a little off putting.  But in reality, as far as we are interested for most applications in Studio, the phrases “keep it simple” and “economy of accuracy” apply.  To try and illustrate this let’s look at some examples in Studio.  Let’s take a simple XML file that contains some translatable text:

<?xml version="1.0" encoding="UTF-8"?>
  <Title>An explanation of XPath in SDL Trados Studio</Title>

In this file there is one translatable element called <Title>.  If I create a new filetype to extract this text I would import the XML file and the parser rules would look like this:


The <xpath> element is the root element and I don’t need this as a parser rule so I’ll remove it in a minute, but for the time being take a look at the “Rule” column.  Here you see two rules; //Title and //xpath.  If I edit the //Title rule I see this:


So as expected the first rule is just me importing the file and Studio taking the element <Title> and making the content in it translatable so that when I open the file in Studio all I will see is the text inside the <Title> elements.  But what you may not have known is that the //Title in the rule column is actually an XPath expression.  Pretty simple huh?  It even looks like the syntax you would use for navigating folders in windows explorer.  So for example, I can add some more translatable text to my example file like this:

<?xml version="1.0" encoding="UTF-8"?>
  <Book lang="en-US">
   <Title>An explanation of XPath in SDL Trados Studio</Title>
   <Text>XPath helps to <b>navigate</b> through the file.</Text>
   <Text>It helps you pick out important <bn>brand names</bn></Text>
    <Text>This should not be extracted.</Text>

So, if I wanted to create a rule using XPath that picked out the <Text> it could look like this:


But if I did this I would get all the <Text> inside the <Notes> element as well and I don’t want that.  So instead I can be more specific like this:


This way it will not pick up the <Text> element in here, //Book/Notes/Text, even though they have the same name.

In this example I would add similar rules for the <b> and <bn> elements and I would make them inline tags so the sentence does not break.  Then I’m going to add a rule that says do not extract anything at all unless I specifically tell you with one of my rules and I do this by using a wildcard.  XPath understands the star symbol to mean select everything… so I add the rule like this and I make it a non-translatable rule where everything else I made “Always translatable”:


This would give me a set of rules like this (I also took a little liberty to apply some simple formatting to some of the rules):


The file, when I open it in Studio looks like this:


So, all very simple and straightforward.  But what happens if the file contains translatable text in an attribute instead of an element?  I don’t think this is really good practice, but we all know that in real life this happens all the time.  So what if the file looked like this where the title of the document has been moved into an attribute called mytitle?:

<?xml version="1.0" encoding="UTF-8"?>
  <Book lang="en-US">
   <Title mytitle="An explanation of XPath in SDL Trados Studio" />
   <Text>XPath helps to <b>navigate</b> through the file.</Text>
   <Text>It helps you pick out important <bn>brand names</bn></Text>
    <Text>This should not be extracted.</Text>

If you were to import this file into Studio and then manually add the rule for translating an attribute like this:


Then you would again see in the “Rule” column that the XPath expression is defined for you like this:


So again this is pretty simple… but of course an attribute is usually a tag so now you will see the document structure column on the right annotates the translatable content as a tag:


At this stage, because Studio has made this so simple you would be forgiven for wondering why you need to know anything about all of this syntax at all.  Hopefully you’ll always receive simple files and never need to.  However… sometimes things are not so simple and this is where XPath comes into its own and you can enter an XPath expression as the new rule here by selecting XPath instead of the element or attribute:


Let’s take a little more complex scenario to see how this works, if our file now contains things that look like this where an attribute value is used to instruct you whether the name should be protected from translation or not:

<Text>Non-translatable <bn lock="y">brand names</bn> are locked</Text>
<Text>Translatable <bn lock="n">brand names</bn> are unlocked</Text>

You still want to see the name, but you want to ensure that the translator will know it has to remain exactly the same.  So here you use a new “Not translatable” rule to identify this change so that when the attribute lock= has a value of “y” then the content should be protected.  The syntax for this uses a reference to the attribute value inside square quotes as follows:


In Studio when I open the file with this new content I now see this where the protected brand names have little padlocks around them and when the tag is copied to the target you will find the text inside is greyed and cannot be changed at all:


You can even string together attributes.  So if the XML file was a multilingual XML file for example, and each part of the file was repeated to allow space for each language like this:

<Text>These <bn lang="en-US" lock="y">brand names</bn> are locked</Text>
<Text>These <bn lang="en-US" lock="n">brand names</bn> are not</Text>
<Text>These <bn lang="de-DE" lock="y">brand names</bn> are locked</Text>
<Text>These <bn lang="de-DE" lock="n">brand names</bn> are not</Text>
<Text>These <bn lang="fr-FR" lock="y">brand names</bn> are locked</Text>
<Text>These <bn lang="fr-FR" lock="n">brand names</bn> are not</Text>

Then in order to prepare a multilingual project with filetypes that extracted only the text for the appropriate language codes you could adapt the same rule we just added for the locked content like this… based on extracting the French translatable text only by stringing together the attributes using natural language queries:

//Book/Text/bn[@lang=”fr-FR” and @lock=”y”]

So now Studio would only extract the text you need from the strings that have the lang=“fr-FR” attribute as well as paying attention to the need to lock content if the lock attribute is “y”.

There are so many things you can do with XPath to manipulate the information in the XML file that was quite tricky, if not impossible, with the older versions of the product that I couldn’t possibly cover them all here.  So if you want to learn more about XPath I would recommend you take a look at the W3 Schools website where they have many really useful tutorials about web programming and one of these is all about XPATH.  You can find the relevant material here : XPath tutorial.

I hope this article was useful and not too geeky… but just to finish off here’s a few examples of things I have used XPath for in the past that might be handy if you come across similar questions when preparing filetypes in Studio for some tricky situations:

Where you have translatable content in this fashion with any element containing the attribute translate, <BodyText translate=“yes”>, then this expression can be used to extract all translatable text.

//A[@M = ‘8804’]/V
You need the text in <V> but only where M=”8804” in <A>. For example:
<A M=“8804”><V>Beschreibung zum Task</V></A>

Translating the content of an attribute with an element defined by a different attribute.  So the translatable content is in the text attribute but only where the attribute id=‘journal1′

//book[@lang=”fr-FR” and @translate=”y”]/ul/li
A way to check for two matching attributes and then the subsequent elements in the path.

  1. Very nice, Paul. This goes in a direction I have hoped others would take for dealing with other aspects of translation data, but I never thought of this particular approach. Score one for SDL :-)

    • Thanks Kevin. Interestingly this is an underused capability that has been in Studio since 2009.

  2. Actually, we did use this a lot. Great feature. Nice that one can automatically use the content inside the XML tags as HTML. Is it fixed so it also works for attributes?

    • Not easily… you still need to handle the attributes for embedded content with regex. An improved solution is on the cards but I’m not sure when it will be ready yet.

  3. Christine said:

    Great article, once again! I was just wondering about whether to use single or double quotes for attribute values. The examples are not consistent in this aspect(@lang=”fr-FR” vs. @id=’journal1′). I guess both works, but single quotes is the official recommendation?

    • Hi Christine, well spotted! Actually the reason I used one or the other was because this is how the attribute values were contained in the XML files I was using. I did have a problem with this on one particular file a long time ago until I realised I was using different types of quotes so ever since then I always copy and paste whatever is in the file to be sure they match. It may be that sometimes different quotes don’t matter… but I prefer to err on the side of caution.

  4. Stéphanie said:

    Wooho! I was not alone! I’m working for a translation agency and one of my missions is to create specific file types for our customers. When I began to work with XPath, I was really excited but…not one of my colleagues has understood “the beauty of the thing”, probably too geeky… ;) Thank you for your good posts! It’s every time a pleasure!

    • Hi Stéphanie, you certainly are not and I’m glad you enjoy the posts. Sometimes I feel a little too geeky as I write them… but then I remember I’m not clever enough to be a real geek! I think these little details are the things that really are the little known strengths of Studio.

  5. Hi Paul, always a pleasure reading your blog (I’ve been bracing myself to download RegexBuddy, but haven’t gotten to it yet :P)

    I came across this post when actually searching a solution for a tricky situation we have using Studio. We have an ongoing software localization project (I’m the PM) from a relatively rare language, and we’re asked to use ttx files created from their original xml. The vendor we have uses Studio, but he’s still a relative newbie to the world of CAT tools and doesn’t work with, and is quite baffled by, Tag Editor.

    The issue is in the ttx files themselves. Along with the source and target segments, and the other usual ttx stuff, there’s an additional row. Something like:

    (context) [really important instruction related to the segment below]
    (source segment with language software stuff)(target segment with localized software stuff)

    From what I’ve been reading, here seems to be an element. So my point is: Is there a way that doesn’t require a CS degree to display this info in a non-translatable way in Studio (tooltips, comments, anything)? There are a lot of workarounds, I know, but they all slow our workflow considerably and we usually have tight deadlines for this. Any ideas? Or are we better off asking a computer engineering graduate to write a custom filter for us? (or am I an idiot and it’s the simplest thing there is? :D) Thank you for your help :)

    • Hello João, I’m not sure whether you mean open the TTX in Studio or the native XML? Either way it’s possible. With TTX you create the ini file to set the element containing the comment as non-translatable and then the preview in 2014 for TTX will allow you still see the comment as you would in TagEditor. But if you want a better experience then just create a stylesheet to go with the XML and then use this to preview the comments. It’s not necessarily that hard to do and I did write an article on how to create a simple stylesheet for a custom XML filetype here : Translate with Style. If this doesn’t help you feel free to drop me an email and I can see whether I can help you easily or if we do need to find a rocket scientist ;-)

  6. Dave Simons said:

    Hello Paul,

    I am trying to process a dita file with an element which is not declared- Studio says that the child element is invalid, and then lists a series of elements it is expecting. Where can I add to that list?

    • Hi Dave, sounds more like a problem being reported on the file itself as opposed to the rules.

      • Dave Simons said:

        Yes- it could be. I’ll go back to the client.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 2,062 other followers

%d bloggers like this: