My son asked me how my day had gone and before I could answer he said in a slightly mocking tone “blah blah blah… XML… blah… XML … blah blah”. Clearly I spend too much time outside of work talking about work, and clearly his perception of what I do is tainted towards the more technical aspects I like the most! Aside from the note to self “stop talking about this stuff after I leave the office!” it got me thinking about why I probably think about XML as much as I apparently do and how I could help others avoid the very same compulsion! I’ve written articles in the past about how to use regular expressions in Studio, and an article on using XPath, and I’ve probably touched on handling XML files from time to time in various articles. But I don’t think I’ve ever explained how to create an XML filetype in the first place, or why you would want to… after all Studio has default filetypes for XML and this is just another filetype that the CAT tool should be able to handle… right?
Wrong! Well partly wrong anyway. If the XML is simple then the default filetypes will probably handle the file perfectly well. But what makes XML unique compared to most other filetypes is that the translatable text could be hidden in user defined locations, and Studio (or any CAT tool for that matter) does not necessarily know where it is, or that it should be translated, without you providing some additional information.
But before I dive in I think it might be helpful to understand a little of the terminology here, not everything you’ll ever need to know about XML as this is a pretty big subject and I’m still learning myself, but rather a few simple things that are relevant to knowing how to extract translatable text. Take this short XML file as an example:
The first red line <grandparent> is called an element. In this case it’s an element I decided to call “grandparent”. In fact this element is a special one because it’s also the first and last element in the file. So this is also called the “root element” and it’s important to note this because we can use the root element as one way to automatically decide which filetype to use when opening an XML file in Studio. There are two more elements in my file; “parent” and “child”. I deliberately used these names because this nesting of the elements inside one another is important. Here we have one element called “parent” which is a child element of the “grandparent” and it contains the translatable text “I’m the parent and I have two children”. Inside this “parent” element I have two “child” elements and they also contain an attribute. The translatable text is in black, “My son is called George.”, and “My daughter is called Sally.” The attributes are providing more detailed information about the children, so in this case defining whether they are male or female. As a general rule, the information I have just provided is how we would like things to be in an ideal world. But that would be too easy and in reality there are no real rules to say when you should use an element and when you should use an attribute! In practice this is just how we would like to see them in the translation industry and if they come like this then the default XML filetype, called AnyXML in Studio, will often suffice. But we want to look at the real world!
Why use custom XML filetypes?
So let’s take a look at a couple of files, starting with an XML file that looks like this and following the simplistic logic I explained above:
This is quite straightforward, every piece of text that looks as though it’s translatable is inside an element. If I open this in Studio without creating a custom XML filetype it looks like this – I’m using the TagID mode to display tags so the tags are numbered and the orange tab at the top is displaying the name of the filetype that is being used to open the file for translation:
So here, the Any XML filetype (the default in Studio) does a pretty good job and even manages to determine what is likely to be inline tags versus external tags, so the text flows quite nicely making it easy for the translator. For this file you could even copy source to target for things like segment #2 which is not really translatable text. So pretty simple and definitely usable… But now let’s take an XML file like this which contains exactly the same content but has been prepared in another, equally valid, XML way:
The one is trickier because some of the translatable text is now in an attribute rather than an element (the productname attribute for example), and the default AnyXML filetype does not extract text from an attribute. We might not consider it to be good practice to use an attribute for this, but in practice we all know it’s pretty commonplace and there can be good reasons for the file to be constructed this way. In addition to this much of the text is provided in the file by using a CDATA section. This is normally used for part of the file that should not be parsed at all with an XML parser, often including characters that would be illegal in XML. This can be a complete html file that is embedded into the XML, or part of one as I have done here, or even something else based on a custom script written by a developer. So there is no single way to handle all embedded content. If I open this file with the default AnyXML filetype then it looks like this where we see html entities, opening and closing tags (<> for example) and the name of these tags (div class=”mat” for example), all of which you would not want to have to try and translate around:
Yuck! Not very nice at all because not only is it parsing the html code inside the CDATA section as translatable text without any kind of tag protection at all, it’s also missing the product name at the start because in this file it was stored in an attribute as opposed to an element. So what we really need is a custom XML file that can deal with the specific nuances of this particular XML file. The release of Studio 2014 SP1 provides a neat way of dealing with the CDATA or any other form of embedded code, but the basic principle of how to create a custom XML filetype in Studio is the same now as it has been since the release of Studio 2009. At the risk of this being a ridiculously long post let’s take a look at how this is done using the second, and more complicated XML example… at least we’ll look at how I normally tackle it as there are other ways.
A general note though. I won’t be covering every single thing about XML file types in Studio either. So if you have a question I didn’t address in this post please refer to the online help which is pretty useful, or post your question into the comments below so we can build a useful reference article for anyone else.
My first step in creating a custom XML filetype in Studio is to import the XML file so I have all the elements and attributes in the XML available to me for selection as I create the custom rules. To do this you go to the File Types node in your options (not forgetting there are differences between the general Options and Project Settings as usual) and click on New which will bring up the Select Type dialog box:
If you’ve done this before in any version of Studio prior to Studio 2014 SP1 you will note that there are now two options for XML.
- XML (Legacy Embedded Content), and
- XML (Embedded Content)
I mention them in this order because the XML (Legacy Embedded Content) is the same as it was in previous versions. XML (Embedded Content) is the new to Studio 2014 SP1 approach and I’ll cover this after discussing the old one briefly as I go through the steps. The next steps will be the same irrespective of which of the two XML types you choose.
File Type Information
First you select the XML filetype you want and click on OK. This brings up the File Type Information dialogue box where the only two things I normally change are the File type name and the File type identifier:
I change these because it makes it easier for me when I’m working to see the name of the filetype I created in the list of available filetypes, and also when I open the file and change the tag display to TagID mode the orange tab at the start and end of each file will display the name of the filetype too. Because I generally create XML filetypes to help other users I find it useful to easily distinguish the names in this way. Then I click on Next >.
XML Settings Import
This takes me to the XML Settings Import dialogue where I typically select my XML file so that all the elements and attributes are added to my filetype to make it easier for me to create the rules:
I browse to my XML file, or one that is representative of a batch of XML files, and after it’s selected as shown above click on Next >.
I now see all the elements that have been identified in the file, and they are listed like this based on whatever defaults Studio believes the tags should be:
At this point I would normally just click on Next > and would address the rules in detail after the filetype was created, but to keep this simple and ensure the article flows logically I’ll make the changes to the parser rules now. But it’s worth noting that if you miss something it doesn’t matter as it’s simple to make changes later on.
If you look at the XML example above, the one with the CDATA as this is the XML we are addressing here, you can see that what we want to extract for translation is the the content of the productname attribute and the productcatalogentry element. The rest I don’t want. So first I remove, or disable, the rootelement and the product. Then I add in two rules… you’ll see the Add…, Edit… etc toolbar becomes active for more options when you select a rule. My Parser Rules now look like this:
I created the //* by selecting the XPATH option in the Add Rule dialog box (read more on XPath in Studio). This is basically a wildcard where the star simply means select everything, and I made that Not translatable. I did this because the first thing I want is to make sure I get nothing at all parsed into my file, and then I can bring in the information I specifically ask for, which in this case is the two rules above that. This is not essential, but I was shown this when I first learned from the Master, Patrik Mazanek, and the habit stuck!
The productcatalogentry element was already there, I just changed the Translate property to Always Translate by editing the rule. I did this as a matter of course because the default is Translatable (except in protected content) and I want to be sure that the content of this element will always be extracted even if it’s parent element is not. Plus of course I wanted to explain this concept that could be the reason for text not being parsed if you set a parent element to Not translatable.
I could have created the //product/@productname rule using XPath too, but because I imported the file into Studio earlier on as part of these steps it’s easier to let Studio do this for me. So I just Add… a new rule and select the Rule type Attribute, then select the element containing the attribute I want to narrow it down (a large XML file could contain a huge list, and sometimes with overlapping attribute values):
I set this to Always Translatable as well and then click Next >.
I’m now brought into the File Detection dialog that provides me with a number of different ways to recognise the XML file I am opening. This is very important because as you create more and more it’s quite easy for Studio to use the wrong filetype for a particular file and if you didn’t notice (also remember why I always change the identifiers at the start) you may find your translated file coming back partially translated or parsing information that should not be at risk of change at all. This is particularly so when handling XML files from the same customer as they may well use the same root element but have different schemas for example. In this case, let’s keep it simple and just use the root element as the criteria for recognising my file:
My root element was actually called rootelement and as you can see in the image it is already populated because I imported the XML into my filetype at the start. So all pretty straightforward… and at this point I can click on Finish… and that’s it. My custom filetype is complete… almost!
My attribute is being parsed this time so segment #1 contains the translatable text from the attribute value (note that this is also annotated as a TAG in the document structure column on the right because it is an attribute and not an element), and I have not got the product code which was extracted with the default AnyXML. So I’m nearly there. All I have to do now is tackle this embedded HTML in the CDATA section.
There are two ways to do this depending on which XML filetype you created, but I’m mostly interested in the new way with Studio 2014 SP1. However, I’ll take a brief look at the Legacy Embedded Content first.
XML (Legacy Embedded Content)
In both the Legacy and the new method for handling Embedded Content you have to first enable it. So for this legacy filetype I would do this and check the box:
When I do this I can now add the Document Structure I want to be handled with the Embedded content processor. At this point you may well be asking what do I mean by this? Well, take the file we have so far. The right hand column, the one that appears when you open the file, contains this information, and you can expand it by clicking on it:
The Code you see here is the code you need to add into the list for any text that contains embedded content that you wish to treat with the embedded content processor. In this example TAG is the code, but I don’t want to use that one as there is no embedded content in this segment. It’s also worth noting that if there were I could not handle embedded content inside an attribute anyway… hopefully most users will never come across anything so poorly written as that! There is in the next segment however as this is the CDATA Section. Now, because these two types of Document Structure are also Studio codes and not custom codes that I created myself I can use the Location (Tag Content in the example above) to identify it in the list. So I actually want the CDATA Section which I can select like this:
Once I’ve done that I need to create my Tag definition rules. Now this will be a similar process to the way you handle embedded content in a Microsoft Excel file which I wrote about in “Handling taggy Excel files in Studio…“. So I won’t write a lot more on this process for Legacy XML filetypes. Suffice it to say the finished rules for my filetype might look something like this:
These take a while to create, are pretty rough and the finished article, whilst better than than the version produced by the AnyXML filetype, does still leave a bit to be desired. I could spend time working on this to make it more user friendly, but even after all of this it would only take a file to be provided that contained different markup and I might have to start changing the rules again. Using these rules I get this which protects things I wanted protected but also doesn’t really make for simple translating because everything is a tag including the entities, and the translator will have no idea about the context of the text because the embedded content rule with this method cannot hold Document Structure Information of their own. I would not say this has no place however, because there are some files where the ability to be able to use regular expressions to protect tags, and text you don’t want to be translated is a real plus. But there is a better way!
Ziad Chama also recorded an excellent webinar that is freely available called “How to create an XML File Type in SDL Trados Studio 2014” which goes through the process in detail. I’d thoroughly recommend you watch this if you have any interest in creating XML filetypes in Studio as it is very informative and Ziad is a real expert. It covers Studio 2014 prior to the release of SP1 which introduced a new method, so that’s what I’ll cover next.
You can also find a handy knowledgebase article here that is straight to the point!
XML (Embedded Content)
But now let’s take a look at the new method in Studio 2014 SP1. The first steps are exactly the same, but when you get to the embedded content section this is where you’ll notice the difference. It looks like this:
So two new things:
- There is a drop down box that seems to refer to completely different filetypes
- You can decide whether to apply the embedded content processing to CDATA Sections or any other named Document Structure Information (as before with the legacy filetype)
Selecting the embedded content processor to use
Let’s tackle point one first. This is a drop down box that refers to completely different filetypes. So you will probably already see that the concept here is to use a filetype within a filetype rather than have to create the regular expressions to handle the content as we did with the legacy embedded content processor:
The defaults are the regular expression filetype, and the two HTML filetypes that come with Studio out of the box. But you can add your own which makes it possible to configure one of these filetypes so it does not use the defaults and then have different embedded content processors depending on the content of the work you are doing. So if I collapse my navigation menu I now see this in my options:
Expanding this allows me to take a copy of one of the three defaults and then configure it as I see fit. I don’t really have to do this for the simple complex example I have used for this article, but this is how you would do it! You click on the Embedded Content Processor node and you’ll see the three available filetypes. Select the one you are interested in; so in my case I picked the HTML 5 filetype, and then click on Copy…:
You get a small dialogue box where you can change the name of the File type and the File type identifier as before, and pay attention to the name because you cannot use the same identifier as you did for the main filetype as duplicate file type IDs are never allowed. Then click on OK and close the Options. You need to close them because if you don’t the list won’t be refreshed (a little issue I’m sure that will get resolved in a future release!) and then when you open the options again and go to the Embedded Content node of the new XML filetype you created you will be able to select your new filetype as an embedded content processor like this:
Identify where the embedded content is found?
This brings us onto my second point which is that you can decide whether to apply the embedded content processing to CDATA Sections or any other named Document Structure Information (as before with the legacy filetype). If the embedded content is in a CDATA section which is probably the most common usecase then now you do nothing more than check the CDATA sections checkbox as shown in the introduction to this part of the article. I can then open the file and see this without having to do any additional work at all:
Much better… and easier! because I’m also still in TagID mode you can see the name of the files which are being used in the orange tabs, and the embedded content processor displays the correct Document Structure Information for this filetype which adds additional context for the translator. You’ll also note that the entity values are correctly transposed so I don’t have to deal with them as tags.
If the embedded content was in another type of Document Structure then it works in exactly the same way. You select the appropriate code and that’s it. No need to add a bunch of regex rules in here.
Sharing custom filetypes with others
I can’t leave this section without mentioning how you share your custom filetypes with others. This is done by exporting your settings, but now with Studio 2014 SP1 you have two lots of settings to share.
This is the settings file for the custom XML filetype you created. To export/import these files you click on the File Types node in your options and then select the specific filetype you wish to export from the list that now appears on the right. This will activate the Import/Export Settings… buttons:
This is the settings file for the custom Embedded Content Processor you created. To export/import these files you click on the Embedded Content Processors node in your options and then select the specific filetype you wish to export from the list that now appears on the right. This will activate the Import/Export Settings… buttons exactly as for above.
This is actually another good reason to always create a copy of your default Embedded Content Processor because if you are sharing custom XML files with a colleague then they may get unexpected results if you used the default HTML file and when your colleague used it the settings were different because they had a customised HTML filetype for example.
Checking your work!
At this point I think I’ve covered enough for you to get started and have a play. But seeing as I’ve written all of this I just wanted to mention Pseudotranslate and how this can help you to make sure your filetype is extracting everything you want, or possibly too much. Once you have completed your filetype to the best of your knowledge, it’s worth opening it quickly with the Translate Single Document approach and without a Translation Memory. Now run the Pseudotranslate batch task with these options:
When complete you will see that the target column of your file in Studio is now full of question marks, so these stick out like a sore thumb! Save the target file and inspect the result with a text editor:
If you missed anything out that should have been translatable text it’s much easier to spot it in here and you can refine your filetype until it’s ready for production. But this looks good to go, as the only recognisable text that is between elements is the product code in the product element, and I deliberately excluded this with my custom filetype!