Why do we need custom XML filetypes?

20_smallerMy son, Cameron László, asked me how my day had gone and before I could answer he said in a slightly mocking tone “blah blah blah… XML… blah… XML … blah blah”.  Clearly I spend too much time outside of work talking about work, and clearly his perception of what I do is tainted towards the more technical aspects I like the most!  Aside from the note to self “stop talking about this stuff after I leave the office!” it got me thinking about why I probably think about XML as much as I apparently do and how I could help others avoid the very same compulsion!  I’ve written articles in the past about how to use regular expressions in Studio, and an article on using XPath, and I’ve probably touched on handling XML files from time to time in various articles.  But I don’t think I’ve ever explained how to create an XML filetype in the first place, or why you would want to… after all Studio has default filetypes for XML and this is just another filetype that the CAT tool should be able to handle… right?

Wrong!  Well partly wrong anyway.  If the XML is simple then the default filetypes will probably handle the file perfectly well. But what makes XML unique compared to most other filetypes is that the translatable text could be hidden in user defined locations, and Studio (or any CAT tool for that matter) does not necessarily know where it is, or that it should be translated, without you providing some additional information.

But before I dive in I think it might be helpful to understand a little of the terminology here, not everything you’ll ever need to know about XML as this is a pretty big subject and I’m still learning myself, but rather a few simple things that are relevant to knowing how to extract translatable text.  Take this short XML file as an example:

05

The first red line <grandparent> is called an element.  In this case it’s an element I decided to call “grandparent”.  In fact this element is a special one because it’s also the first and last element in the file.  So this is also called the “root element” and it’s important to note this because we can use the root element as one way to automatically decide which filetype to use when opening an XML file in Studio. There are two more elements in my file; “parent” and “child”.  I deliberately used these names because this nesting of the elements inside one another is important.  Here we have one element called “parent” which is a child element of the “grandparent” and it contains the translatable text “I’m the parent and I have two children”. Inside this “parent” element I have two “child” elements and they also contain an attribute.  The translatable text is in black, “My son is called George.”, and “My daughter is called Sally.”  The attributes are providing more detailed information about the children, so in this case defining whether they are male or female. As a general rule, the information I have just provided is how we would like things to be in an ideal world.  But that would be too easy and in reality there are no real rules to say when you should use an element and when you should use an attribute!  In practice this is just how we would like to see them in the translation industry and if they come like this then the default XML filetype, called AnyXML in Studio, will often suffice.  But we want to look at the real world!

Why use custom XML filetypes?

So let’s take a look at a couple of files, starting with an XML file that looks like this and following the simplistic logic I explained above:

02_simple_xml

This is quite straightforward, every piece of text that looks as though it’s translatable is inside an element.  If I open this in Studio without creating a custom XML filetype it looks like this – I’m using the TagID mode to display tags so the tags are numbered and the orange tab at the top is displaying the name of the filetype that is being used to open the file for translation:

02_simple

So here, the Any XML filetype (the default in Studio) does a pretty good job and even manages to determine what is likely to be inline tags versus external tags, so the text flows quite nicely making it easy for the translator.  For this file you could even copy source to target for things like segment #2 which is not really translatable text.  So pretty simple and definitely usable… But now let’s take an XML file like this which contains exactly the same content but has been prepared in another, equally valid, XML way:

03_complex_xml

The one is trickier because some of the translatable text is now in an attribute rather than an element (the productname attribute for example), and the default AnyXML filetype does not extract text from an attribute.  We might not consider it to be good practice to use an attribute for this, but in practice we all know it’s pretty commonplace and there can be good reasons for the file to be constructed this way.  In addition to this much of the text is provided in the file by using a CDATA section.  This is normally used for part of the file that should not be parsed at all with an XML parser, often including characters that would be illegal in XML.  This can be a complete html file that is embedded into the XML, or part of one as I have done here, or even something else based on a custom script written by a developer.  So there is no single way to handle all embedded content.  If I open this file with the default AnyXML filetype then it looks like this where we see html entities, opening and closing tags (<> for example) and the name of these tags (div class=”mat” for example), all of which you would not want to have to try and translate around:

03_complex_default

Yuck!  Not very nice at all because not only is it parsing the html code inside the CDATA section as translatable text without any kind of tag protection at all, it’s also missing the product name at the start because in this file it was stored in an attribute as opposed to an element.  So what we really need is a custom XML file that can deal with the specific nuances of this particular XML file. The release of Studio 2014 SP1 provides a neat way of dealing with the CDATA or any other form of embedded code, but the basic principle of how to create a custom XML filetype in Studio is the same now as it has been since the release of Studio 2009.  At the risk of this being a ridiculously long post let’s take a look at how this is done using the second, and more complicated XML example… at least we’ll look at how I normally tackle it as there are other ways.

A general note though.  I won’t be covering every single thing about XML file types in Studio either.  So if you have a question I didn’t address in this post please refer to the online help which is pretty useful, or post your question into the comments below so we can build a useful reference article for anyone else.

First Steps

My first step in creating a custom XML filetype in Studio is to import the XML file so I have all the elements and attributes in the XML available to me for selection as I create the custom rules.  To do this you go to the File Types node in your options (not forgetting there are differences between the general Options and Project Settings as usual) and click on New which will bring up the Select Type dialog box:

06

If you’ve done this before in any version of Studio prior to Studio 2014 SP1 you will note that there are now two options for XML.

  1. XML (Legacy Embedded Content), and
  2. XML (Embedded Content)

I mention them in this order because the XML (Legacy Embedded Content) is the same as it was in previous versions.  XML (Embedded Content) is the new to Studio 2014 SP1 approach and I’ll cover this after discussing the old one briefly as I go through the steps.  The next steps will be the same irrespective of which of the two XML types you choose.

File Type Information

First you select the XML filetype you want and click on OK.  This brings up the File Type Information dialogue box where the only two things I normally change are the File type name and the File type identifier:

07

I change these because it makes it easier for me when I’m working to see the name of the filetype I created in the list of available filetypes, and also when I open the file and change the tag display to TagID mode the orange tab at the start and end of each file will display the name of the filetype too.  Because I generally create XML filetypes to help other users I find it useful to easily distinguish the names in this way.  Then I click on Next >.

XML Settings Import

This takes me to the XML Settings Import dialogue where I typically select my XML file so that all the elements and attributes are added to my filetype to make it easier for me to create the rules:

08

I browse to my XML file, or one that is representative of a batch of XML files, and after it’s selected as shown above click on Next >.

Parser Rules

I now see all the elements that have been identified in the file, and they are listed like this based on whatever defaults Studio believes the tags should be:

09

At this point I would normally just click on Next > and would address the rules in detail after the filetype was created, but to keep this simple and ensure the article flows logically I’ll make the changes to the parser rules now.  But it’s worth noting that if you miss something it doesn’t matter as it’s simple to make changes later on.

If you look at the XML example above, the one with the CDATA as this is the XML we are addressing here, you can see that what we want to extract for translation is the the content of the productname attribute and the productcatalogentry element.  The rest I don’t want.  So first I remove, or disable, the rootelement and the product.  Then I add in two rules… you’ll see the Add…, Edit… etc toolbar becomes active for more options when you select a rule.  My Parser Rules now look like this:

10

I created the //* by selecting the XPATH option in the Add Rule dialog box (read more on XPath in Studio).  This is basically a wildcard where the star simply means select everything, and I made that Not translatable.  I did this because the first thing I want is to make sure I get nothing at all parsed into my file, and then I can bring in the information I specifically ask for, which in this case is the two rules above that.  This is not essential, but I was shown this when I first learned from the Master, Patrik Mazanek, and the habit stuck!

The productcatalogentry element was already there, I just changed the Translate property to Always Translate by editing the rule.  I did this as a matter of course because the default is Translatable (except in protected content) and I want to be sure that the content of this element will always be extracted even if it’s parent element is not.  Plus of course I wanted to explain this concept that could be the reason for text not being parsed if you set a parent element to Not translatable.

I could have created the //product/@productname rule using XPath too, but because I imported the file into Studio earlier on as part of these steps it’s easier to let Studio do this for me.  So I just Add… a new rule and select the Rule type Attribute, then select the element containing the attribute I want to narrow it down (a large XML file could contain a huge list, and sometimes with overlapping attribute values):

11

I set this to Always Translatable as well and then click Next >.

File Detection

I’m now brought into the File Detection dialog that provides me with a number of different ways to recognise the XML file I am opening.  This is very important because as you create more and more it’s quite easy for Studio to use the wrong filetype for a particular file and if you didn’t notice (also remember why I always change the identifiers at the start) you may find your translated file coming back partially translated or parsing information that should not be at risk of change at all.  This is particularly so when handling XML files from the same customer as they may well use the same root element but have different schemas for example.  In this case, let’s keep it simple and just use the root element as the criteria for recognising my file:

12

My root element was actually called rootelement and as you can see in the image it is already populated because I imported the XML into my filetype at the start.  So all pretty straightforward… and at this point I can click on Finish… and that’s it.  My custom filetype is complete… almost!

13

My attribute is being parsed this time so segment #1 contains the translatable text from the attribute value (note that this is also annotated as a TAG in the document structure column on the right because it is an attribute and not an element), and I have not got the product code which was extracted with the default AnyXML.  So I’m nearly there.  All I have to do now is tackle this embedded HTML in the CDATA section.

There are two ways to do this depending on which XML filetype you created, but I’m mostly interested in the new way with Studio 2014 SP1.  However, I’ll take a brief look at the Legacy Embedded Content first.

XML (Legacy Embedded Content)

In both the Legacy and the new method for handling Embedded Content you have to first enable it.  So for this legacy filetype I would do this and check the box:

14

When I do this I can now add the Document Structure I want to be handled with the Embedded content processor.  At this point you may well be asking what do I mean by this?  Well, take the file we have so far.  The right hand column, the one that appears when you open the file, contains this information, and you can expand it by clicking on it:

15

The Code you see here is the code you need to add into the list for any text that contains embedded content that you wish to treat with the embedded content processor.  In this example TAG is the code, but I don’t want to use that one as there is no embedded content in this segment.  It’s also worth noting that if there were I could not handle embedded content inside an attribute anyway… hopefully most users will never come across anything so poorly written as that!  There is in the next segment however as this is the CDATA Section.  Now, because these two types of Document Structure are also Studio codes and not custom codes that I created myself I can use the Location (Tag Content in the example above) to identify it in the list.  So I actually want the CDATA Section which I can select like this:

16

Once I’ve done that I need to create my Tag definition rules.  Now this will be a similar process to the way you handle embedded content in a Microsoft Excel file which I wrote about in “Handling taggy Excel files in Studio…“.  So I won’t write a lot more on this process for Legacy XML filetypes.  Suffice it to say the finished rules for my filetype might look something like this:

17

These take a while to create, are pretty rough and the finished article, whilst better than than the version produced by the AnyXML filetype, does still leave a bit to be desired.  I could spend time working on this to make it more user friendly, but even after all of this it would only take a file to be provided that contained different markup and I might have to start changing the rules again.  Using these rules I get this which protects things I wanted protected but also doesn’t really make for simple translating because everything is a tag including the entities, and the translator will have no idea about the context of the text because the embedded content rule with this method cannot hold Document Structure Information of their own.  I would not say this has no place however, because there are some files where the ability to be able to use regular expressions to protect tags, and text you don’t want to be translated is a real plus.  But there is a better way!

18

Ziad Chama also recorded an excellent webinar that is freely available called “How to create an XML File Type in SDL Trados Studio 2014” which goes through the process in detail.  I’d thoroughly recommend you watch this if you have any interest in creating XML filetypes in Studio as it is very informative and Ziad is a real expert.  It covers Studio 2014 prior to the release of SP1 which introduced a new method, so that’s what I’ll cover next.

You can also find a handy knowledgebase article here that is straight to the point!

XML (Embedded Content)

But now let’s take a look at the new method in Studio 2014 SP1.  The first steps are exactly the same, but when you get to the embedded content section this is where you’ll notice the difference.  It looks like this:

19

So two new things:

  1. There is a drop down box that seems to refer to completely different filetypes
  2. You can decide whether to apply the embedded content processing to CDATA Sections or any other named Document Structure Information (as before with the legacy filetype)
Selecting the embedded content processor to use

Let’s tackle point one first.  This is a drop down box that refers to completely different filetypes.  So you will probably already see that the concept here is to use a filetype within a filetype rather than have to create the regular expressions to handle the content as we did with the legacy embedded content processor:

21

The defaults are the regular expression filetype, and the two HTML filetypes that come with Studio out of the box.  But you can add your own which makes it possible to configure one of these filetypes so it does not use the defaults and then have different embedded content processors depending on the content of the work you are doing.  So if I collapse my navigation menu I now see this in my options:

22

Expanding this allows me to take a copy of one of the three defaults and then configure it as I see fit.  I don’t really have to do this for the simple complex example I have used for this article, but this is how you would do it!  You click on the Embedded Content Processor node and you’ll see the three available filetypes.  Select the one you are interested in; so in my case I picked the HTML 5 filetype, and then click on Copy…:

23

You get a small dialogue box where you can change the name of the File type and the File type identifier as before, and pay attention to the name because you cannot use the same identifier as you did for the main filetype as duplicate file type IDs are never allowed.  Then click on OK and close the Options.  You need to close them because if you don’t the list won’t be refreshed (a little issue I’m sure that will get resolved in a future release!) and then when you open the options again and go to the Embedded Content node of the new XML filetype you created you will be able to select your new filetype as an embedded content processor like this:

24

Identify where the embedded content is found?

This brings us onto my second point which is that you can decide whether to apply the embedded content processing to CDATA Sections or any other named Document Structure Information (as before with the legacy filetype).  If the embedded content is in a CDATA  section which is probably the most common usecase then now you do nothing more than check the CDATA sections checkbox as shown in the introduction to this part of the article.  I can then open the file and see this without having to do any additional work at all:

25

Much better… and easier!  because I’m also still in TagID mode you can see the name of the files which are being used in the orange tabs, and the embedded content processor displays the correct Document Structure Information for this filetype which adds additional context for the translator.  You’ll also note that the entity values are correctly transposed so I don’t have to deal with them as tags.

If the embedded content was in another type of Document Structure then it works in exactly the same way.  You select the appropriate code and that’s it.  No need to add a bunch of regex rules in here.

Sharing custom filetypes with others

I can’t leave this section without mentioning how you share your custom filetypes with others.  This is done by exporting your settings, but now with Studio 2014 SP1 you have two lots of settings to share.

*.sdlftsettings

This is the settings file for the custom XML filetype you created.  To export/import these files you click on the File Types node in your options and then select the specific filetype you wish to export from the list that now appears on the right.  This will activate the Import/Export Settings… buttons:

28

*.sdlecsettings

This is the settings file for the custom Embedded Content Processor you created.  To export/import these files you click on the Embedded Content Processors node in your options and then select the specific filetype you wish to export from the list that now appears on the right.  This will activate the Import/Export Settings… buttons exactly as for above.

This is actually another good reason to always create a copy of your default Embedded Content Processor because if you are sharing custom XML files with a colleague then they may get unexpected results if you used the default HTML file and when your colleague used it the settings were different because they had a customised HTML filetype for example.

 Checking your work!

At this point I think I’ve covered enough for you to get started and have a play.  But seeing as I’ve written all of this I just wanted to mention Pseudotranslate and how this can help you to make sure your filetype is extracting everything you want, or possibly too much.  Once you have completed your filetype to the best of your knowledge, it’s worth opening it quickly with the Translate Single Document approach and without a Translation Memory.  Now run the Pseudotranslate batch task with these options:

26

When complete you will see that the target column of your file in Studio is now full of question marks, so these stick out like a sore thumb!  Save the target file and inspect the result with a text editor:

27

If you missed anything out that should have been translatable text it’s much easier to spot it in here and you can refine your filetype until it’s ready for production.  But this looks good to go, as the only recognisable text that is between elements is the product code in the product element, and I deliberately excluded this with my custom filetype!

THE END!! 

41 comments
  1. Gee, Paul, I can’t thank you enough for including a “get out quick” button from this article — I availed myself of the opportunity immediately and ended up in a close digital approximation of Nirvana. Have nothing to say about the rest of the text, but loved the unexpected bonus of calm.
    Truly, thanks. ;-)
    Zakiya

    • Excellent… with any luck this post will have a little something for everyone!

  2. Juraj said:

    Paul, the CDATA regex are terrifying, for this file you need just this two :)

    <[^<]+?>
    &#\d+?;

    • Hi Juraj, of course there are always more than one way to skin a cat ;-) If you are happy to have every tag as a placeable, and not to extract the alt text on the image then these two rules would do it. Certainly catch all expressions have their benefits. But if you wanted to try and provide a better tag handling experience, have formatting on the extracted text or even different segmentation depending on the context of the tags, then you would need to handle them as separate expressions.
      However, the point I’m making is using regex for this is sometimes terrifying indeed!

  3. Alison Field said:

    Excellent!!!

    ________________________________

  4. André said:

    Wow thanks a lot for this long but very interesting post. I’ll be able to use Studio in a better way for my next XML+CDATA project. If only you had posted it 1 month sooner… it would have saved me a few hours :-)

  5. walkqisky said:

    Nice shoot, thank you for clarifying the DSI stuff, and it will be much better if a thorough DSI explanation is provided, e.g. what do “Block Quote” and “Callout” use for in the screenshot of the section “XML (Legacy Embedded Content)”? Any links to refer to?

    • We don’t have any links for these as they are just a list of simple and common structures we recognise in other filetypes. So “Block Quote” would be referring to a block of text set off from the main text. Common in html and xml. The same for “Callout” which would be a bock of text used to annotate an image perhaps, or something like that. You don’t have to use the correct ones for the correct situation, and you could create your own custom ones. But if you were familiar with the creation of html or xml files, then you would probably be familiar with these terms.
      But I kind of agree with you that it would be handy to have a list somewhere. I did come across this one for docbook which explains some, and if I find time I might validate a list of the ones we use as standard and see if we can publish them in a KB or something.

      • walkqisky said:

        Thank you Paul, very suggestive, and reminds me of your blog on taggy excel which used the “cell” thing, hopefully a list could be summarized out for different DSI usage.

  6. Manicle said:

    Hi Paul,

    I will try to make it clear. In the XML file type, there is an entity conversion function (used in my case to convert the html representation of accented letters into a readable format and to restore it once the translation is done). Could you please confirm that, with the legacy mode, this conversion does not occur for content in CDATA sections ? However, it seems to be OK since version 2014 SP1 thanks to the new Embedded content processor.

    • Hi Nicolas, with the previous version this did not work because you only had regex at your disposal. So you could convert to tags, but not do a proper entity conversion. In Studio 2014 SP1 with the new processor this works fine for most cases.

  7. Juraj said:

    Hi Paul,

    One more question, I’m trying to use this new Embedded Content Processors on one file. I wanna lock a specific text (mod_1364893518685_2901.xml) using the “Embedded Content Plain Text v 1.0.0.0″ I have created a new one, set the Inline Tags, the the newly created file type I have set the Document structure for the “body” but I cant get any reasonable output. Its basically still the same.

    • Kind of tricky to answer a question like this Juraj. Send me the file so I have some idea of what you are doing and what you are looking at. These comments are probably not the best place for a question like this without any qualification at all.

  8. Daniel McCosh said:

    Hi Paul,
    Thanks so much for putting all this together! Before I reinvent the wheel, I wonder if anyone here has had any experience translating wordpress content as xliff or xml. We have a multisite wordpress installation with the multilingual press plugin installed for German and English and our developers have created a custom xliff export function (which unfortunately exports xliff 1.0). I’ve set up a custom xml file type to just enable the target element from the xliff file for translation (that works) and I’m trying to convert into an internal tag and deal with short codes with translatable attributes e.g.
    [/collapse]

    [collapse title=”Philosophische Fakultät und Fachbereich Theologie” color=”gold”].
    I’m not really sure of the best approach and I’d really welcome any tips or pointers. Should I be using the Embedded Content Plain Text processor for this?
    Thanks,
    Daniel

    • walkqisky said:

      Hi Daniel, I encountered the same request for translating wordpress xliff files generated by a multilingual plugin wpml, and the solution for this scenario is Rainbow, an open source toolkit on the basis of OKAPI framework, and the wordpress xliff file could be properly handled by the filter XML stream, then work the file out using Tageditor or some other xliff editors rather than Studio, and convert back. Hope this helps.
      @Paul:
      Looking forward Studio could publish a new file filter to properly handle wordpress xliff file or alike thing.

      • Hi, I think this is a good opportunity for an openexchange filetype for this flavour of XLIFF, if anyone is interested to create one? Or I could find a developer for this if you guys wanted to share the cost?

      • Daniel McCosh said:

        Hi,
        We moved away from WPML because of a bug in an earlier version with the language switcher but we export XLIFFs in a similar way. For anybody looking to process XLIFFs exported by XML it’s confusing because there seems to be some out of date information on wpml.org relevant to older Trados and Studio versions. We use Studio 2014 SP1 but when we used Studio 2011, trying to open the XLIFF files came up with an error message about the DTD due to the older version of XLIFF. I don’t think you really need Rainbow – you can just create a new file type based on the xml file type and create rules for embedded content in Studio or an ini file in TagEditor. WPML converts line breaks in the wordpress post to which makes for lots of tags in some posts… and these tags should be protected in your custom file type, it seems like WPML support aren’t really aware that these can be protected in Studio (http://wpml.org/forums/topic/sdl-trados-studio-tag-issues-with-xliff-0-9-4/). Unfortunately, I’m pressed for time at the moment and but if my file type helps anyone using WordPress and Studio, feel free to download it as an example from: http://mccosh.de/wpml.sdlftsettings — I just made the content within the element translatable and added a few example rules to deal with the line breaks and an example shortcode with a translatable attribute. I know it’s good practice to add //* into the XML parser rules but I kept accidentally making the CDATA untranslatable and ran out of time to work it out myself (I also don’t have a budget…). If we improve these settings, I’ll submit it to OpenExchange. Hope you enjoyed your film, Paul ;)
        Regards,
        Daniel

    • Jerzy Czopik said:

      Hi Daniel
      Do you still have the wordpress settings?
      We encountered a WP file and I tried to use HTML5 for CDATA and excluding some elements from translation with a little of success, but would be quite interested to see your results and learn from them.
      MTIA and best regards, Jerzy

  9. Daniel McCosh said:

    Hi again,
    Sorry, that was quite an involved question. I managed to find a solution by reading your other posts on embedded content and testing with RegExBuddy etc.
    Best,
    Daniel

    • Thanks for updating the post Daniel… this was something I saved for my flight back home this afternoon! Now I can watch a movie :-)

  10. walkqisky said:

    Hi Paul,
    I have a question on the Embedded Content Processor (ECP), it seems I have to create a new filetype to apply the ECP other than to any of the exist filetypes. So there will be no big help if I need to process xlsx, xliff or ttx file with a lot of CDATA content in it, and a bunch of regex is still needed to write by hand, am I correct?
    I think SDL Trados users will be happy until the ECP extends its applying range, and it is indeed a half-product by now.

    • I wouldn’t call it a half product, but there is definitely room for further enhancements to make this easier. The new embedded content processor certainly makes it more comfortable for XML filetypes containing embedded content, and the plan is to extend this capability to more, but the biggest improvement will come when you can apply this principle to any filetype including the ones that currently cannot handle embedded content at all… such as the two you mention (XLIFF and TTX). TTX should probably be handled prior to becoming a TTX, but it would be useful to be able to make up for inadequately prepped TTX files too.

  11. korapi said:

    Hello Paul,

    I work with wordpress posts exported as xml files. The outcome is an embedded file with HTML 5 code. I´ve followed your instructions on the new procedure in 2014 sp1 and it works like a charm! Well, almost…
    There´s a block of code that Trados does´t extract correctly:

    [caption id=”attachment_477820″ align=”alignnone” width=”984″]altimg class=” wp-image-477820 ” src=”http://whatever.png” alt=”blah blah blah.” width=”984″ height=”380″ /> BLAH BLAH BLAH [/caption]

    Actually the problem is only on this part, which Trados thinks is translatable text:
    [caption id=”attachment_477820″ align=”alignnone” width=”984″] [/caption]

    I guess I can solve this problem if I tweak something in the parser rules section of the embedded content processor of my HTML5 filetype, but I have no clue how to do it.

    ¿Could you give me a little hint?

    • Hi, the problem you’ll have is that if you use the HTML filetype for embedded content then you cannot use regex to handle the script. So to be complete you will either have to use the legacy embedded content and define rules for everything, or use the SDK and create a filetype specifically for these files. The latter is probably a better idea but it does require developer skills.
      Incidentally, to display the square brackets you need to write them as entity values.

      • korapi said:

        I´ve tried the legacy way but it´s really sickening because there´s always something wrong when opening the file in Trados and I´ve no idea about the sdk. I thought I could tweak something in the parser rules of the embedded content processor.

        Thanks very much anyway :)

      • The legacy is tricky because you have to be very careful about not creating duplicate rules that conflict with each other. I’m pretty sure there will be further improvements to the embedded content processor and also to the ability of the HTML filter to handle scripts. That’s the problem really I think… the square brackets are not being handled as scripts.

      • korapi said:

        By the way, I´ve also two more questions regarding SDL Trados as I´m really new to it:

        Do you know what I can do to get a style sheet to preview the content of this kind of file? Trados tells me there´s no preview for the custom file I created.

        Is there something I can do to make the preview window show up quicker? When I click on preview it moves so slowwwwlyyyy..

      • On the preview… you have to create a custom stylesheet. I wrote an article on how to do this here : Stylesheets
        On the slow preview window… not sure. Maybe dock it on a separate screen (if you use two) and then you won’t have to use it at all in slide in slide out mode.

  12. korapi said:

    Hello Paul, I´m still struggling with xml embedded content. I gave the legacy way (using regular expressions) another try and this time I could manage to get a clean translatable file, but still have a problem:

    every time trados finds an html tag, it “breaks” the sentence and creates another segment. This way, it shows some sentences divided into two, three or more segments. Do you know If there´s a way to prevent this?

    • Hi, my guess is you are using some generic catch all expressions in addition to more specific ones. The problem with this is that you often duplicate rules and this can cause the effect you describe. So you need to be very careful and if you want a catchall in addition to a few specific rules then you need to make sure you exclude the specific ones in the catchall expression. So for example if you wanted a catch all tag pair expression that excluded br tags for example, then you could use something like this as the opening expression:

      (?:(?!(<br))<[a-z][a-z0-9]*[^<>]*>)

      And this as the closer:

      </.*?>

      Maybe that will give you some ideas that might help?

      • korapi said:

        Hi Paul. I´ve tried your suggestion but still getting the same results, sentences “broken” at the beginning of a tag. I think the problem is that I´m not very good at regexp :) If I come up with something positive I´ll let you know.

        Thanks very much for your help.

  13. Jesse Good said:

    Hi Paul,
    JTLYK, if Trados 2014 UI is in Japanese on a Japanese Windows 7 OS,
    an exception will be thrown when you click “Browse” on the “XML Settings Import” screen in your tutorial.
    Here is the description of the exception thrown:
    “Filter string you provided is not valid. The filter string must contain a description of the filter, followed by the vertical bar (|) and the filter pattern. The strings for different filtering options must also be separated by the vertical bar. Example: “Text files (*.txt)|*.txt|All files (*.*)|*.*”

    Also the stack trace points to:
    Sdl.FileTypeSupport.Native.Xml.WinUI.ImportRulesControl._browseButton_Click_1(Object sender, EventArgs e)

    This bug can be worked around by changing the UI to English.

    Thanks

    • Hi Jesse, thanks for letting me know. Best to report these things to support but I’ll send it in in case they don’t already know.
      Regards
      Paul

  14. Tommy Tomolonis said:

    Hi Paul,

    Love the article, but I can’t get this to work on my files. I turned off the Entity conversion because I need < and > to remain as is instead of , but now I would like the content between them to be non-translatable tag pairs. Currently, Studio makes each &;lt; and &gt: into individual placeholder tags instead of tag pairs. Furthermore, whatever is between them is translatable (since they’re placeholders). For example, “The cat is black” becomes “The cat is <b>black</b>” where each < and > are placeholders (4 in total) and the “b” and “/b” are translatable text. Is it possible to disable entity conversion for these two entities and then make them into non-translatable tag pairs so that “b” and “/b” cannot be modified? This would make the file much easier to work with.

    Tommy

    • Hello Tommy, can you send me an example file… just a small representative sample? Will be easier for me to answer.

      • Hi Paul and Tommy. I have the same problem with the lt; and gt; tags. Did you were able to find a solution? Thanks in advance!

  15. Daniel McCosh said:

    Hi Paul,
    Sadly there is a bug in the HTML Embedded Content Processor (SDL reference 47280) which leaves the source text in attributes that have been translated – support say this is due to be patched in September. Just a heads up for anyone who is planning to use this and notices untranslated attributes. Quite unfortunate for us.
    Thanks,
    Daniel

    • Hi Daniel, correct but just to clarify that this is only for attributes in the embedded html and not in the xml file itself.

  16. Similarly, there is a bug ID 46352 that prevents me from applying ECP to a tag content (in my case, the content of an attribute). I hope to get a fix for that soon…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,965 other followers

%d bloggers like this: