Data Protection…

There’s always been the occasional question appearing on the forums about data protection, particularly in relation to the use of machine translation, but as of the 25th May 2018 this topic has a more serious implication for anyone dealing with data in Europe.  I’ve no intention of making this post about the GDPR regulations which come into force in May 2016 and now apply, you’ll have plenty of informed resources for this and probably plenty of opinion in less informed places too, but just in case you don’t know where to find reliable information on this here’s a few places to get you started:

With the exception of working under specific requirements from your client, Europe has (as far as I’m aware) set out the only legal requirements for dealing with personal data.  They are comprehensive however and deciphering what this means for you as a translator, project manager or client in the translation supply chain is going to lead to many discussions around what you do, and don’t have to do, in order to ensure compliance.  I do have faith in an excellent publication from SDL on this subject since I’m aware of the work that gone into it, so you can do worse than to look at this for a good understanding of what the new regulations mean for you.

You will have noted by now lots of companies asking you to agree to their updated terms and conditions so they can ensure they are complying with the requirements of GDPR… you may well be as fed up with them as I am too!  Well on a less serious note I am hoping that by ignoring all the companies asking me to agree to their new terms and conditions or risk not getting their junk will actually result in me not getting their junk!

Back to business and the things I do want to cover in this article relate to apps available for SDL Trados Studio that can be used to help support you in your endeavours to ensure compliance with any data privacy requirements while working on your translation projects. The tasks involved for you if you are sending out data for translation, or in support of translation, will be to ensure that you have controlled the distribution of personal data either by working on the files in a controlled environment, or by anonymizing the data before you send it out.  The first part of this will be to identify what that data is.  If it’s all names and addresses then this is going to be tricky without your client identifying this for you so that you know what to look for.  You can use regular expressions to find proper nouns but this is going to lead to a lot of false positives, especially in other languages like German for example where so many words start with a capital letter in the middle of sentences.  There are AI tools available on the market that can have a go at identifying this sort of data but at the moment they are probably the wrong side of a financial barrier to be really useful for most translators and I think more work is required to improve their effectiveness in a multilingual environment.  So having a list of names and addresses, or having your client anonymize the documents first is going to be more helpful.  Telephone numbers, car registration details, dates, email addesses, PCI data (Payment Card Industry), IP addresses etc. these are all things you can try to find using regular expressions and expect to have more success.

So I’m going to focus on some tools that can help with finding and anonymizing data in a more manual process.  I’ll probably update this document in the coming weeks as we’re working on another tool to help with existing Translation Memories, but for now let’s take a look at these four:

projectAnonymizer

This is a new application on the appstore that’s designed to support the conversion of translatable data contained in the bilingual files of your project into tags, with an optional encryption, so that two things are possible:

  1. the data you wish to protect can’t be changed, or read if the encryption option is used
  2. the data you wish to protect is securely added to a translation memory as a placeholder

The application is called projectAnonymizer but technically it’s pseudonomizing the project files because the process can be undone once the translation is complete to ensure that the target files are returned to the client with all the correct data in place.  The data is only protected for the translation process.  There’s a wiki in the SDL Community that explains how this application works but in a nutshell here’s the suggested workflow:

  • Create the Studio project
  • Identify the data that needs to be protected and create the rules
  • Run a batch task “Protect Data” to add/import the rules used to identify the data and encrypt with a password as required
  • Complete the translation/review workflow
  • Update your Translation Memories
  • Run a batch task “Unprotect Data” to remove all protection, encrypted or otherwise
  • Correct anything that was affected through the non-translation of protected data
  • Optionally protect the files again to update your TMs and then unprotect them for the final step
  • Save your target files

Whilst I really like this application and believe it certainly has a place in protecting data, in practice I believe that once the hype around GDPR dies down a little and working practices have found, some normality, some sanity, and the protection of data will be less problematic.  As you’ll see in the video I have produced you can end up with several problems that will have to be corrected by a “trusted” linguist reviewer anyway and this is going to add time, cost and quality issues onto work that is already under pressure to be cheaper, faster and better quality:

  • personal names carry a gender, so not knowing them could adversely affect the translation
  • dates normally need to be localized, so protecting them leaves a lot of date checks to work on afterwards
  • etc.

These sort of things will drive the creation of practical processes and I think the trusty NDA is likely to be used more often than it is already.  The main problem to be dealt with will be information stored in the Translation Memory, and in particular historical data created before we worried about all this stuff.  I will be adding an application to this collection in the coming weeks to deal with this problem of personal data in Translation Memories which affects everyone as a data processor.

To make the whole process of using the projectAnonymizer app easier to follow I have created a short video here which explains how I think it works in practice:

Length: 10 minutes 11 seconds

SDLXLIFF Anonymizer

This is a great tool developed by Tom Imhof that can be used to anonymize SDLXLIFF files., so the files in your Studio Project.  It’s different to the projectAnonymizer because it doesn’t handle the translatable content, rather it handles the information you can’t see that is stamped onto each sentence you translate.  This information is not only held in the SDLXLIFF but it’s also sent through to the Translation Memory when you return the bilingual files to your client.

The sort of data I’m referring to is this:

  • Translation origin
  • Origin system
  • Created by
  • Modified by
  • Created on
  • Modified on

The relevant fields here would be “Created by” and “Modified by”, but the rest are very useful for other things… you can use your imagination for that one!  An important point to note here is that this tool is used after the translation is complete so the workflow would be something like this:

  • Project Manager creates the project and send out for translation
  • Translation is completed and the translator anonymizes the files with this plugin before sending the files back, OR
  • Translation is completed and the Project Manager anonymizes the files before updating them into their Translation Memory

Now, what this doesn’t deal with is what happens when you work with server based TMs, or even if you are sharing file-based TMs over an internet connection or on a shared drive.  In these cases applying the changes to the user names afterwards will be too late as the names are already in the Translation Memory, so you should change the username in your settings under the Batch Processing menu:

This setting always seems illogically placed to me because it suggests that this is only applying to a batch task.  But it doesn’t, it applies to every segment you confirm and will update the SDLXLIFF as you work with whatever you write in here.  If you do this then you won’t need to use the SDLXLIFF Anonymizer at all… at least not for ensuring you don’t populate a Translation Memory with your name.  If you want to ensure that there is no mention of your name in the file you worked on at all then you have to think about whether you did these things or not:

  • added any comments
  • used track changes
  • reviewed using TQA
  • created the SDLXLIFF yourself from the source files

All of these things stamp the SDLXLIFF with the username applied from Windows, they won’t go into a Translation Memory but they are held in the file.  You could use search & replace in a text editor to anonymize your name, but be careful with changing the path to the original file which is used in the SDLXLIFF file to find the original when you need to save the target file!  If you’re really concerned about your username being in the SDLXLIFF files at this level then you should seriously think about changing your Windows name instead.  Good to think carefully when you set up your new computer that wants to be friendly and personal… use a name that doesn’t give anything away!  This is actually a good example of what I was referring to earlier when I mentioned normality and sanity!  Everything we do today is designed to feel personal and easy to use, it’s part of taking away the unfriendliness of using a computer.  So now, when some legislation comes in about personal data and it’s at the core of everything we do how far do you have to go?  I was attending a conference recently where the day started with the organisers dealing with a queue of attendees who all came of their own volition and already provided their details online, yet were being asked to sign a document specifically to ensure compliance with the GDPR… this is really paranoia in my opinion and all common sense has flown out of the window!  At some point companies need to realise that many of these things, including your username in the SDLXIFF are probably collected for legitimate business interests and shouldn’t cause any failure to comply with the GDPR.

But we are still at the beginning of all this madness, so back to the SDLXLIFF (partial) Anonymizer… to make this explanation easier to follow I have created a short video here which explains how this works in practice:

Length: 4 minutes 59 seconds

TMX Anonymizer

This nice little application, also from Tom Imhof, automates a relatively simple process that you could carry out yourself in a decent text editor.  But this is a lot more convenient supporting multiple TMs and no room for user error!  The process is to change three types of attributes that are found in a TMX file:

  • creationid
  • changeid
  • usagecount

The usagecount is less interesting for the purpose of data protection, but the first two are essentially the userid of the original translator and the last translator to use/edit a translation unit.  If the translators were using the SDLXLIFF Anonymizer this would not be needed, but I think most do not, and we often see requests in the forums from translators wanting to know how to anonymize their userid from their files, or from a translation memory.  The process, using this application, would be this:

  • export your SDLTMs to TMX from SDL Trados Studio
  • Run the TMX Anonymizer application
  • Select the TMX files
  • Give an appropriate value for the creationid and userid
  • Save the anonymized TMX files
  • Upgrade the anonymized TMX files to SDLTM in SDL Trados Studio
  • Delete the old Translation Memories

This doesn’t deal with the translated content in your translation memories at all, or custom field values that may contain data you wish to anonymize.  It only addresses the userid stored in the system fields.

To make this process easier to follow I have created a short video here which explains how this works in practice:

Length: 5 minutes 10 seconds

SDLTmConvert

I’ve written about this wonderful tool, another from Costas Nadalis in the past.  This tool can do most of the things that the TMX Anonymizer can do…. and then a lot more including filtering with SQL statements!  You can also work on the SDLTM so you don’t have to convert to TMX first as the conversion is handled by the app; the output won’t be an SDLTM, but it can be a TMX.  A disadvantage of this application over the TMX Anonymizer is that you can only handle one TM at a time, so if you have a lot to work through the TMX Anonymizer may be a preferred tool.  The process with this tool would probably be something like this:

  • Run the SDLTmConvert application
  • Give an appropriate value for username in the TM
  • Optionally remove all field values by ticking “Remove TU info”
  • Save the anonymized TMX files
  • Upgrade the anonymized TMX files to SDLTM in SDL Trados Studio
  • Delete the old Translation Memories

There is another use for this application which I’ve saved till last. The first problem you’re going to have when you start to look at your translation memories is how do you identify the personal data in your translation memory in the first place? You might have a million translation units or more so this is a duanting task. The search tools related to translation memories in most translation tools are not great and even those that may have quite good features won’t provide any features that allow you to use the results of a search for anything very practical. SDLTmConvert however has a very neat feature that supports exporting the source or/and the target text to a text file without any markup. This will take seconds at the most and you end up with a very nice list of translations in a text file. Once you have this you can use search patterns to find what you need. Still a very manual task, but for example something like ([A-Z\d][,.\wa-z-]+\u0020*){2,} would find proper nouns made up of at least two words (plus a little extra to catch likely address formats) and this could find many personal names, addresses, product names perhaps as well as some false positives, but the list would be smaller and more relevant/easier to edit. I use EditPad as a text editor and it has a very nice set of search features which allows me to place the search results into a new file, then remove all duplicates, trailing and leading spaces etc. so I’m actually left with a much easier file to sanity check. This isn’t foolproof of course but I think this approach does have some merit when looking for lists of actual text to import into the anonymizing tool I know the SDL Community Development team are working on now. If you tackle the searches logically then I think you’ll have a pretty good set of information to use. Of course if you have access to some kind of NLP tool for identifying personal data then I’m sure these cleaned source and target files created by SDLTmConvert would be similarly useful for that.

To make this easier to follow I have created a short video here which explains how you can work with SDLTmConvert in practice, both for the TMX conversion and the extraction of text for searching:

Length: 9 minutes 39 seconds

1 comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: