Now that we’ve learned enough about regular expressions, and because I get so many requests for custom filetypes I thought it might be useful to take a dip into the world of XPath. So what exactly is XPath?
Well as far as most CAT tools go it probably is something completely different… certainly it was not used in the old Trados days. But as a tool it’s nothing new and is simply a language used to find parts of an XML document and what’s more it’s a language that is recommended by the World Wide Web Consortium W3C. So there is nothing proprietary here.
If you did dive in and start to look at this documentation I referred to it may, unless you lean towards the technical side, be a little off putting. But in reality, as far as we are interested for most applications in Studio, the phrases “keep it simple” and “economy of accuracy” apply. To try and illustrate this let’s look at some examples in Studio. Let’s take a simple XML file that contains some translatable text:
<?xml version="1.0" encoding="UTF-8"?> <xpath> <Title>An explanation of XPath in SDL Trados Studio</Title> </xpath>
In this file there is one translatable element called <Title>. If I create a new filetype to extract this text I would import the XML file and the parser rules would look like this:
The <xpath> element is the root element and I don’t need this as a parser rule so I’ll remove it in a minute, but for the time being take a look at the “Rule” column. Here you see two rules; //Title and //xpath. If I edit the //Title rule I see this:
So as expected the first rule is just me importing the file and Studio taking the element <Title> and making the content in it translatable so that when I open the file in Studio all I will see is the text inside the <Title> elements. But what you may not have known is that the //Title in the rule column is actually an XPath expression. Pretty simple huh? It even looks like the syntax you would use for navigating folders in windows explorer. So for example, I can add some more translatable text to my example file like this:
<?xml version="1.0" encoding="UTF-8"?> <xpath> <Book lang="en-US"> <Title>An explanation of XPath in SDL Trados Studio</Title> <Text>XPath helps to <b>navigate</b> through the file.</Text> <Text>It helps you pick out important <bn>brand names</bn></Text> <Notes> <Text>This should not be extracted.</Text> </Notes> </Book> </xpath>
So, if I wanted to create a rule using XPath that picked out the <Text> it could look like this:
But if I did this I would get all the <Text> inside the <Notes> element as well and I don’t want that. So instead I can be more specific like this:
This way it will not pick up the <Text> element in here, //Book/Notes/Text, even though they have the same name.
In this example I would add similar rules for the <b> and <bn> elements and I would make them inline tags so the sentence does not break. Then I’m going to add a rule that says do not extract anything at all unless I specifically tell you with one of my rules and I do this by using a wildcard. XPath understands the star symbol to mean select everything… so I add the rule like this and I make it a non-translatable rule where everything else I made “Always translatable”:
This would give me a set of rules like this (I also took a little liberty to apply some simple formatting to some of the rules):
The file, when I open it in Studio looks like this:
So, all very simple and straightforward. But what happens if the file contains translatable text in an attribute instead of an element? I don’t think this is really good practice, but we all know that in real life this happens all the time. So what if the file looked like this where the title of the document has been moved into an attribute called mytitle?:
<?xml version="1.0" encoding="UTF-8"?> <xpath> <Book lang="en-US"> <Title mytitle="An explanation of XPath in SDL Trados Studio" /> <Text>XPath helps to <b>navigate</b> through the file.</Text> <Text>It helps you pick out important <bn>brand names</bn></Text> <Notes> <Text>This should not be extracted.</Text> </Notes> </Book> </xpath>
If you were to import this file into Studio and then manually add the rule for translating an attribute like this:
Then you would again see in the “Rule” column that the XPath expression is defined for you like this:
So again this is pretty simple… but of course an attribute is usually a tag so now you will see the document structure column on the right annotates the translatable content as a tag:
At this stage, because Studio has made this so simple you would be forgiven for wondering why you need to know anything about all of this syntax at all. Hopefully you’ll always receive simple files and never need to. However… sometimes things are not so simple and this is where XPath comes into its own and you can enter an XPath expression as the new rule here by selecting XPath instead of the element or attribute:
Let’s take a little more complex scenario to see how this works, if our file now contains things that look like this where an attribute value is used to instruct you whether the name should be protected from translation or not:
<Text>Non-translatable <bn lock="y">brand names</bn> are locked</Text> <Text>Translatable <bn lock="n">brand names</bn> are unlocked</Text>
You still want to see the name, but you want to ensure that the translator will know it has to remain exactly the same. So here you use a new “Not translatable” rule to identify this change so that when the attribute lock= has a value of “y” then the content should be protected. The syntax for this uses a reference to the attribute value inside square quotes as follows:
In Studio when I open the file with this new content I now see this where the protected brand names have little padlocks around them and when the tag is copied to the target you will find the text inside is greyed and cannot be changed at all:
You can even string together attributes. So if the XML file was a multilingual XML file for example, and each part of the file was repeated to allow space for each language like this:
..... <Text>These <bn lang="en-US" lock="y">brand names</bn> are locked</Text> <Text>These <bn lang="en-US" lock="n">brand names</bn> are not</Text> ..... <Text>These <bn lang="de-DE" lock="y">brand names</bn> are locked</Text> <Text>These <bn lang="de-DE" lock="n">brand names</bn> are not</Text> ..... <Text>These <bn lang="fr-FR" lock="y">brand names</bn> are locked</Text> <Text>These <bn lang="fr-FR" lock="n">brand names</bn> are not</Text> .....
Then in order to prepare a multilingual project with filetypes that extracted only the text for the appropriate language codes you could adapt the same rule we just added for the locked content like this… based on extracting the French translatable text only by stringing together the attributes using natural language queries:
//Book/Text/bn[@lang="fr-FR" and @lock="y"]
So now Studio would only extract the text you need from the strings that have the lang=“fr-FR” attribute as well as paying attention to the need to lock content if the lock attribute is “y”.
There are so many things you can do with XPath to manipulate the information in the XML file that was quite tricky, if not impossible, with the older versions of the product that I couldn’t possibly cover them all here. So if you want to learn more about XPath I would recommend you take a look at the W3 Schools website where they have many really useful tutorials about web programming and one of these is all about XPATH. You can find the relevant material here : w3schools.com XPath tutorial.
I hope this article was useful and not too geeky… but just to finish off here’s a few examples of things I have used XPath for in the past that might be handy if you come across similar questions when preparing filetypes in Studio for some tricky situations:
Where you have translatable content in this fashion with any element containing the attribute translate, <BodyText translate=“yes”>, then this expression can be used to extract all translatable text.
//A[@M = '8804']/V
You need the text in <V> but only where M=”8804” in <A>. For example:
<A M=“8804”><V>Beschreibung zum Task</V></A>
Translating the content of an attribute with an element defined by a different attribute. So the translatable content is in the text attribute but only where the attribute id=‘journal1′
//book[@lang="fr-FR" and @translate="y"]/ul/li
A way to check for two matching attributes and then the subsequent elements in the path.