Handling taggy Excel files in Studio…

#01By taggy files I mean “embedded xml or html content” that is written into an Excel file alongside translatable text.  In the last article I wrote I documented a method sometimes used by people to handle tagged content in a Word file… funnily enough I came across a Word file containing the XML components of an IDML file today and I guess it must have been prepared in a very similar way judging by the enormous number of tags using the tw4win style to hide them when opened by any SDL Trados version!  Proof for me that this practice is sadly alive and well.  But I digress… because this time I want to cover how to handle a similar problem when you find HTML or XML tagged content in an Excel file.  This crops up quite a bit on ProZ so I thought it might be better to document it once and for all so I have something else to refer to in addition to the Studio help.

Studio uses a concept when creating custom XML files of parsing the file again based on the document structure type of an XML parser rule and replacing patterns you create in the parsed text with tags.  Now let me say that again in English… Studio can look in the content of the text that is extracted for translation and then pick out the bits you don’t want to see and convert them to tags.  So for example, if you had an Excel file that contained things like this:

#02

And then you opened this file in Studio you would see something that looked just like the Excel spreadsheet but what you would probably prefer is what it can be changed into as shown below:

#03

So you want to protect all the angle brackets and text between them.  Just in case you don’t like to see all of this in wysiwyg mode don’t forget that you don’t have to.  You can change the font sizes as shown by Kevin Lossner and Jayne Fox in a neat little video, or you can also select the default to always show you consistent plain text, and all tags (because we know they are really there even in wysiwyg mode!) all the time with this option here… so plenty of choice to suit your preferences:

#04

Of course you also don’t have to convert the plain text excel file into the crazy formatting I showed here!

But the important thing is that we have converted all of the tagged content in the Excel file into protected tags in Studio so that you can safely translate the text alone.  How do you do this… easy!

You just create some rules, using a little regex, to pick out the text that should be tags.  These rules are all added through the Excel filetype settings for XLS and XLSX filetypes in here (the screenshot shows XLSX):

#05

So the process is to first enable the “Embedded Content Processing” in 1. by ticking the box, and then selecting “Cell” from the list of available types.  This is because for Excel the ONLY one that works is “Cell”.  The rest are all part of the available types when you use the same “Embedded Content Processor” in a custom XML filetype, but they have no effect in the Excel filetype.  It makes sense when you think about it as we are dealing with “Cells” in Excel… but it’s not the most intuitive part of this solution.

Once you have enabled the processing you can add your rules as I have in 2.  I was a little flamboyant with them in this case just to show you what could be done if you wanted… I could have converted all of the tags in this file with three rules… maybe less if I was really clever.  In reality, most Excel files I see translators having problems with only contain quite simple XML/HTML and in these cases the first catch all rule below will probably handle the complete file:

TRANSLATABLE TAG PAIR - CATCH ALL
<[a-z][a-z0-9]*[^<>]*>                     </[a-z][a-z0-9]*[^<>]*>
PLACEABLES
{[0-9]}
Alt attribute
<.*alt="                                   ">

The interesting thing is that in my actual example, by getting a little flamboyant I have actually shown how simple it can be because I have just taken the literal text that formed the tags and added these as rules.  For example, I don’t want <b></b> tags to be text.  So I add them in as a translatable tag pair here:

#06

Quite simple when you look at it like this… but the drawback is that you need to add a rule for every type of tag in the file which is what I did to create the colourful view above.  If you have a lot of different tags and it’s a big file (or lots of files) then the slicker regex rule is much better and it may well be all you need to catch all the tags:

#07

Once you have added all your rules, and made them as fancy as you like, you can open the Excel file and all being well you’ll see protected tags, or a fancy wysiwyg format to handle the file.

Just to finish off… the same file displayed using the “no wysiwyg” option I mentioned above will show as follows even if I have set all the fancy rules I did.  The segments that don’t show any tags are like this because the tags are actually at the start and end of the cells, so they are not required.  If I did want to see them (and have to deal with them) this is also possible by changing them to be internal rather than external in the advanced rules as you add the regular expressions:

#08

45 comments
  1. Frank said:

    Very nice and useful write-up, like all the previous ones! Thanks a lot!
    I am just wondering, without reading your great article: How would I have known that I need to select “Cell” from the document structure selection list? I did not find any documentation about the Excel Embedded Content handling in the On-line Help? Wouldn’t it be useful to pre-configure “sdl:cell” for the Excel Embedded Content settings, while leaving Embedded Content processing disabled?
    Please continue writing useful articles like this one!

    • Hi Frank, thanks for your comment and I’m glad you find it helpful. A good question too… I looked all over the place and couldn’t find this mentioned so I have asked our TechPubs to address this in the online help. I guess I came across this problem when we released the feature, asked a developer, and didn’t think about it again until now… my bad! But I agree with you on making it straightforward and I know development have a better solution in mind so this is probably why they have not addressed this specific one. So for now, we can fix the documentation and I hope this article helps as well.

  2. Richard said:

    Hi paulfilkin,

    I am quite bit wondering about creating custom XML file. Can I use the tag <strong> instead of <b>?

    • Hi Richard, the good thing about XML is you can use whatever you like… or whatever the originating system can interpret. So whatever elements are in the XML files you receive then this is what you create rules for. If you wish to add tags not in the file that just make sure that the system creating the XML files will know what they mean when they get the target file back? So if <strong> is the tag used for bold by the originating system then all will be well. But if it’s <b> then it may not be a clever idea.

  3. LEFEVRE Nicolas said:

    Is there an “Embedded content” section planned for XLIFF file types ? It would be very very useful.

  4. I wish there was such an option with xliff files as well. I am trying to import a MemSource xliff file (it works after I renamed the extension) but unfortunately there is no way to protect tags (they look like this, I added extra space so that the be displayed correctly)

    {4}{5 & g t ; None & l t ; 5}{6}

    • There will be a solution to this shortly using an OpenExchange application that creates a nice clean protected xliff for translation.

      • Good to know Paul; please give me a shout when it is available.

      • walkqisky said:

        I am wondering if this app is already online.

      • Which app? This is all about functionality that’s built into Studio.

  5. Malcolm said:

    If you have a number of tags, how do you represent these in Studio’s regex?

    • Well… I guess if the dots are all squashed up as they get when you type an elipse in Excel or Word then use a rule like this (just copy and paste from excel or word straight into the placeholder space):

      But if they are not then this I guess:
      \.{3}
      Both work for me depending on how the elipse is put together.

  6. HI Paul,
    Following your instructions I used the start and end tags “{” and “}” which did the trick. However, when translating, the tags appear like this in target segment http://imageshack.com/a/img839/2795/9tbh.png when using Microsoft MT. I wonder whether there is a way to avoid that.

    • For the time being I am using a “low-tech” way: Paste Excel file to Word, apply internal style to tags, import to Studio. That way I have no issues with MT and tags getting messed up.

  7. lydiaer said:

    I’m trying to use these settings (the three catch all ones) for a new xml file type to handle embedded html. I seem to get an error everytime I prepare a file (object not set to an instance…) but it goes away when I restart SDL or go to a different project then back again – are there limitations to these regex?

    • Don’t forget that regex is designed to suite the circumstances. For most excel files with embedded html these rules are going to be good enough… for most but not as a rule set in stone for everything. You need to make sure you create rules to suit the content of the file you have, especially if you have an XML as opposed to Excel which is quite likely to require correct handling. Send me the xml file if you can and I’ll take a quick look – pfilkin@sdl.com.

  8. Sarah said:

    Hi Paul, I have been trying to add a rule (placeholder) for a line break which can be either represented as or in HTML. This is in my Excel file type. Somehow, Trados doesn’t like this and sends me an error message at the time of preparing the file saying that the file is missing tag pairs. Do you know where that would be coming from and if there is a work-around for that type of tags?

    • Hi Sarah, I can’t see the tags you tried to demonstrate… maybe you can drop me an email and I’ll take a look?

      • See your message above: “There will be a solution to this shortly using an OpenExchange application that creates a nice clean protected xliff for translation.:”

  9. Lieven said:

    Hi Paul, I’ve been trying to find a way to translate HTML text copied from website back-offices (CMS) and pasted into Excel files, because of the embedded content feature. Although this works for < > and {} tags, I’m stuck on HTML specials such as “&”, which should ideally be converted as in the Convert Entities feature for XML files. Do you know of a simple way to include this in the Excel embedded content feature? Or any other ideal way to translate such HTML content from website content management systems?

    • Hi, for now you have to use a regex for this too. I’m hopeful we’ll be able to chain the html filter to the excel filter in a future release and this will make it easier. But for now you need to add a rule… maybe something like this:
      &.*?;
      You could even create a specific set of different rules that were less general so you could go to the advanced settings and add the actual text you would see inside the tag which might make it a little better for the translator.

      • Lieven said:

        Thanks Paul. That’s what I thought, or was afraid of as it’s a bit annoying to see accented characters as tags inside words. I’ll try the advanced settings. Thanks. Lieven

      • Lieven said:

        Hi again, When adding e.g. “&” as a placeholder and ticking “Tag represents the text: &” in Advanced, should this not convert the tag into “&” instead of showing the tag? Or what this option do exactly? Or am I forgetting something? Many thanks, Lieven

      • Hi Lieven, well I may have given you a bum steer. I’m pretty sure this is how it should work, but I see the problem you may have come up against. I just tested this and don’t see the intended display… I just get the very entity I wanted to hide. I have reported this, but didn’t want you wasting more time on it until I hear back next week. Might be a bug.

  10. Chun-yi Chen said:

    Hi Paul, This is an interesting and informative article, thank you once again for sharing your expertise!
    I handle these type of files occasionally, which contain a lot of formatting tags like as you described. However, the client (or the agency I should say) always converts these Excel files to itd files. Does this mean there will be no way for me to tweak the settings and hide these tags in the itd file? I looked at the SDL Edit under File Types and did not see the “Embedded content” option.

    • Hi Chun-yi, correct. Unfortunately if the file has been converted to ITD already then you can’t do this because there is currently no embedded content processor for this format.

      • Chun-yi Chen said:

        Thank you for confirming this. Too bad that my client still uses SDLX. I can’t work with a CAT tool that does not have the filter function. So I will not bug you guys at SDL to add this embedded content functionality to the itd file type–this is legacy and we need to move on:)

  11. Dave Simons said:

    This is a great resource. I might be a bit late here, but I have an Excel sheet which contains XML, and while I can handle most of the content, how can I get Trados to protect the XML declaration? (). Your help is appreciated!

    • I guess you could convert the whole thing to a tag? So if the declaration was this:

      <?xml version=”1.0″ encoding=”UTF-8″?>

      You could add a regex for the entire thing something like this:

      <\?.*?\?>

      I think that would probably work.

      • Dave Simons said:

        That works perfectly! Thank you so much for your quick response.

  12. Hey Paul
    Awesome post. Thanks a lot. I’ve extended your post a bit:

    http://remy.supertext.ch/2014/08/translate-excel-files-with-embedded-html-content-in-sdl-trados-studio/

    But I’m are struggling a bit with the alt tag. You example only works if there are no other attributes inside the img tag. As soon as there are src, style, class or anything else in a different sequence it falls apart, Unfortunately, I’ve tried , but either it does not segment it correctly or I get an Unmatched tag pair error. Since you seem to be the Regex-King maybe you have a better idea?

    • Hey Remy… I’d hardly say regex king… I’m more like a learner trying to share the things I’ve learned along the way!

      I think handling embedded code in excel is great for simple cases. More complex scenarios really should not be in Excel in the first place, but we do see them. So I think if it was practical I would remove the heavily scripted segments and handle with the html filter, then put them back. If it’s not practical then regex is the way to go, but how you handle it will depend on the content of the file and the sort of things you want to extract. The expressions are probably going to become quite complex and fraught with the likelihood of errors from unmatched tag pairs as you have found out. This is just because the expressions have to become more specific to avoid this happening.

      The best approach is coming (not sure when we will see this yet) and that will be to have the ability to use the html filter inside the excel filetype for embedded content, exactly as we have done for the xml filetype.

  13. Handling HTML in Excel the same way as in XML would be awesome.
    I agree with you that Excel is not the best format, but not that I can control this. Sharepoint for example does this.

    Anyway, was just hoping that you might have a solution for the img alt tags, but no worries. We can work around that.
    In anycase, I think my blog post might help a few people.

  14. Hey Paul, different issue today. It seems that all tags that I define as Placeholder are not separating segments. Even if I choose “Exclude” in the Segmentation Hints under Advanced Properties.
    Regex: |\s+[^>]*/>) for a

    Anything else I have to set?

      • This is correct Frank, but perhaps you can do this via segmentation rules in the TM instead?

      • Frank & Paul, thanks for the feedback. Not sure if I understand your point. This works fine for Tag Pairs it seems. Just for placeholders it does not.

      • In this case… as I wasn’t aware of this difference which sounds like a bug… why not create a tag pair instead? So for example, if the placeholder was <br> then use opening and closing tags like this:
        <b(?=r>)
        (?<=<b)r>
        That seems to do the trick for me.

      • Great idea. The tag recognition works, but I get “the document cannot be processed since it contains unexpected contents.” as soon as I change the tag to “Exclude” in the Segmentation hint dropdown. Which was kinda the idea of this whole thing :-)

      • Perhaps you’ve got some overlap with your expressions? Can you email me your file and then I don’t have to guess? The regex solution is quite handy, but having the ability to use a proper html filter inside Excel will be better!

      • Frank said:

        Hi Paul, I am missing the “Like” button! This is so cool! Thanks

    • Hi Arjen, maybe review this article and post again… it’s a wordpress limitation and you need to use entities for your example in the post :

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,804 other followers

%d bloggers like this: