Say goodbye to doppelgangers with ChatGPT…

With all the excitement and interest around ChatGPT these days, and with the numerous interesting projects we’re working on at RWS that involve the use of AI, its hard not to allow the lure of the technology to pull you in.  Last month I wrote an article on corrupt translation memories, and in doing this I dabbled a little with SQLite queries… an SDLTM is a SQLite database.  They are pretty much new to me as I’ve only used very simple statements… one liner lookups for example… in the past.  So I had to read some basics and learn a little when I did that.  Nothing new, I like to read this sort of material and learn a little something new anyway.  So when I was asked this evening how do you remove translation units from a translation memory that have the same source and same target I immediately started to think SQLite.

Why SQLite and not the product?  Well, surprisingly enough there is no function in Trados Studio to do this.  The maintenance features in the Translation Memory view are fairly basic, and even more surprisingly we have never built a small tool to enhance this with the AppStore Team.  So back to the SQLite documentation to figure out how to do this.

Safety First

The first step in carrying out any such maintenance directly on an SDLTM is to take a copy of your TM and NEVER EVER do this without safely backing up first!!  Once you’re working on a copy you can do whatever you like!

Working with SQLite

To work with SQLite I’m finding that the first thing to do is to make sure you have a sensible SQLite editor to make it easier.  I use the excellent DB Browser for SQLite which seems to give me everything I need at the moment.  All I need now is a translation memory, so I create a TMX from the DGT collection, English to French, containing around 76794 translation units and upgrade it into Trados Studio.  The upgrade cleans out invalid, blank, number only etc. units and this leaves me a translation memory holding 63044 translation units.  Perfect for testing with.

I then set about trying to learn how to find translation units that have the same source/target content.  After some reading I decided I really needed to use a regular expression to find what I needed since the content I’m looking for is actually wrapped in an XML wrapper within the SQLite table rows.  Something like this:

To get at this I figured I could use a regular expression, and came up with this:


That does a pretty good job of extracting only the text I’m interested in:

My problem is figuring out how to use this in a SQL Query.  After 15 minutes or so (I am short on patience…) I came up with this:

SELECT source_segment, target_segment
FROM translation_units
WHERE source_segment REGEXP (?<=^.+?<Value>).+?(?=</Value>) = target_segment REGEXP (?<=^.+?<Value>).+?(?=</Value>);

Which duly returned this error in the DB Browser application:

A little more research and I learn there are only limited options for regex in SQLite which I have definitely exceeded here… and to be honest I’m not even confident of my syntax for the rest of the query.

ChatGPT to the rescue!

So I decided to see if I could solve this by asking my new OpenAI friend, ChatGPT.  I ask it what’s wrong with my query?

Well… wow!  Every time I look at this I’m amazed at how good the answers are.  At least they always look good and this can be deceiving.  Certainly a lot of the tests we have done at work prove how often it can be eloquently wrong as well as perfectly brilliant!  So let’s take a look at this suggested code which I can even conveniently copy with a single click:

SELECT source_segment, target_segment
FROM translation_units
WHERE substr(source_segment, instr(source_segment, ‘<Value>’) + 7, instr(source_segment, ‘</Value>’) – instr(source_segment, ‘<Value>’) – 7) =
substr(target_segment, instr(target_segment, ‘<Value>’) + 7, instr(target_segment, ‘</Value>’) – instr(target_segment, ‘<Value>’) – 7);

I paste it into the DB Browser and execute it:

It runs perfectly and finds 46 rows in my 63 thousand TU translation memory that are same source same target, and it finds them in 592ms.  I checked them and they’re correct!

So I naturally ask it how I would adapt the previous query to delete each row in the database table:

Once again it obliges instantly with the requested statement:

DELETE FROM translation_units
WHERE substr(source_segment, instr(source_segment, ‘<Value>’) + 7, instr(source_segment, ‘</Value>’) – instr(source_segment, ‘<Value>’) – 7) =
substr(target_segment, instr(target_segment, ‘<Value>’) + 7, instr(target_segment, ‘</Value>’) – instr(target_segment, ‘<Value>’) – 7);

Looks simple enough, so I execute that one too!

In 492ms I deleted all 46 rows from my TM.  I then open it up in the Translation Memories View of Trados Studio and check it works.  Everything looks good, so I check the number of TUs and it’s still 63044 TUs.

So then I ask my AI friend if there is anything I need to do after removing rows in a SQLite database table to ensure it’s properly prepared for use:

The answer is again pretty good and I execute the VACUUM command which completes in 22853ms:

Certainly I am learning some stuff here and it is very specific to my actual task every time.  I now open in Trados Studio again and it’s still showing 63044 TUs.  So I do the obvious and recompute the fuzzy index statistics, and reindex the translation memory.  This returns exactly what I hoped to see:

And now, after closing the settings window and opening it again the number of translation units correctly show 62998.  I’m good to go!

The steps…

I think at this stage a quick recap would be useful of just what the steps for removing same source same target (SSST) translation units (TUs) would be:

  1. take a backup of your SDLTM
  2. open the SDLTM in the DB Browser
  3. optionally check for these SSST TUs by executing this query:
    1. SELECT source_segment, target_segment
      FROM translation_units
      WHERE substr(source_segment, instr(source_segment, ‘<Value>’) + 7, instr(source_segment, ‘</Value>’) – instr(source_segment, ‘<Value>’) – 7) =
      substr(target_segment, instr(target_segment, ‘<Value>’) + 7, instr(target_segment, ‘</Value>’) – instr(target_segment, ‘<Value>’) – 7);
  4. if there are any delete them with this statement:
    1. DELETE FROM translation_units
      WHERE substr(source_segment, instr(source_segment, ‘<Value>’) + 7, instr(source_segment, ‘</Value>’) – instr(source_segment, ‘<Value>’) – 7) =
      substr(target_segment, instr(target_segment, ‘<Value>’) + 7, instr(target_segment, ‘</Value>’) – instr(target_segment, ‘<Value>’) – 7);
  5. execute the VACUUM command to optimize the TM database, reclaim any unused space and improve the performance of the TM
  6. open the TM in Trados Studio in the Translation Memory View and “Recompute Fuzzy Index Statistics…”, then “Re-index Translation Memory”
  7. note the correct number of deleted TUs from the DB Browser match the number shown in Trados Studio

Now you know this, the final thing I’d say is you do it at your own risk!!  So backup… backup… backup before you try it.  Certainly I’m personally encouraged by this exercise to see what else I can learn to do in terms of managing the content in a TM with SQLite… and I’ll also see what we can do in terms of making a robust solution via the appstore!

The takeaway…

Well, there is no denying how useful AI, and this OpenAI tool, ChatGPT, is going to be.  I’ve been playing with it for a while and asking it questions around everything I get involved with just to see how it responds.

Tonight I have literally gone from trying to tackle this real question I was asked about deleting translation units in a translation memory , to writing up this article in around 3-hours despite being pretty ignorant of how to run SQLite queries like this.  Admittedly I did know how to write the regular expression needed to feed into ChatGPT and I did know how to resolve the problem of recomputing fuzzy indexes etc.  But even though I have not checked I’m pretty sure I could have got there without even this rudimentary knowledge just by asking the questions in plain language.

But it is dangerous and should be used with care.  I think in the hands of someone who can apply it sensibly and ethically it’s a fantastic tool to have at your disposal.  But there’s also a very negative aspect to it ranging from the negative effect this could have on our children’s ability to think for themselves, to massive amounts of misleading and incorrect content appearing in what’s already an overwhelming barrage of information and stimuli, often negative, from multiple social media platforms…. and more!

To finish off I asked ChatGPT about this very question… and its answer was:

AI solutions like ChatGPT can be used ethically and without harming people if they are developed, trained, and deployed with careful consideration for ethical and safety concerns. Some best practices for ethical AI include: ensuring fairness and avoiding discrimination, protecting privacy, being transparent about how AI systems make decisions, and avoiding potential harm to individuals or society. However, the ultimate responsibility for ethical AI usage lies with the people and organizations using these technologies, and they should continuously monitor and assess the impacts of AI systems to ensure they align with ethical principles and values.

So its down to us… I’m going to try and remain optimistic!

TM corruption…

This is a topic that probably occurred a lot more in the old days of Trados and Translators Workbench where it was relatively easy to corrupt a translation memory.  In those days the translation memory consisted of five file types with the extensions .tmw, .mwf, .mtf, .mdf and .iix and when problems did occur it was probably related to the files that supported indexing and lookup speeds for example.  The .tmw file itself that contained the translation units was less likely to be the source of the problem.  So fixing it could often be achieved by creating a new translation memory with the same settings, and then simply replacing the .tmw in the new translation memory with the old one… finally reorganising.  This didn’t always help, but if often did!

Continue reading “TM corruption…”

Can’t see the wood for the trees…

Back in 2015 I wrote an article called “Good bugs… bad bugs!” which was all about the unintended positive side effect as a result of computer software not working as intended.  I’d actually forgotten about this article until this weekend as I was pondering my own behaviour in responding to a post in the RWS Community.  In fact it was my wife that got me thinking as I allowed the community thread to frustrate me because I couldn’t understand why some users can’t see reason… my reason!  I had comfortably created two buckets in my mind.. either they are just incapable of understanding and I’m talking to a brick wall or they just won’t understand because they don’t want to listen since it doesn’t suit their own agenda.  It didn’t help that none of my suggestions were even acknowledged, but nonetheless it took my wife to remind me that perhaps I wasn’t listening to them properly!

Continue reading “Can’t see the wood for the trees…”

Character counts…

The most viewed article I have ever written by far was “So how many words do you think it was?” which I wrote in 2012 almost ten years ago.  I revised it once in 2015 and whilst I could revise it again based on the current versions of Trados Studio I don’t really see the point.  The real value of that article was understanding how the content can influence a word-count and why there could be differences between different applications, or versions of the same application, when analysing a text.  But I do think it’s worth revisiting in the context of MT (machine translation) which is often measured in characters as opposed to words… and oh yes, another long article warning!

Continue reading “Character counts…”

The portfolio merry-go-round…

Growing a product range, buying new companies, being bought yourself, adopting new technology, reorganising  etc… all of this creates significant change across an organisation that often feels as though you’re on a merry-go-round where things change as you go around until you’re back to where you started and then it all changes again.  I can only imagine that feeling applies to customers and employees alike as each revolution strives to be better than the last, easier to navigate, meaningful in its purpose and full of the promise of success once properly implemented… and yet slightly confusing at the same time!

Continue reading “The portfolio merry-go-round…”

Translating AI…

Why would you have to?  Surely Ai can translate itself?  If not it sounds like a pretty big topic… or I’m just confused.  Acronyms can do this to you and these days we do have good reason to be confused… Multiterm/Machine Translation (MT), National Aeronautics and Space Administration/North America South America (NASA), Role Playing Game/ Rocket Propelled Grenade (RPG), Wages For Housework/Working From Home (WFH)… the latter essentially being the same!!  The list is huge and these days I find myself looking something up almost every day.  Ai is another one… Artificial Intelligence is probably what crossed your mind right from the start, particularly since I put it on top of a brain!  I actually found 164 meanings for this acronym but only one of them matches the topic for my article… and that is Adobe Illustrator which should be a far more manageable topic for translation!

Continue reading “Translating AI…”

The Black Box…

An interesting thing about Trados Studio is how easy it makes it for new users to save all their projects in one place, all their translation memories and all their termbases.  In fact it’s so easy that I’m absolutely certain many users never give these things a second thought as they start to get to grips with using the software itself.  But then, one day, these users upgrade… in some cases even the upgrade seems to happen automagically… and then the problems start.  After becoming a competent Trados Studio user, excited to try the latest version, all of a sudden the translation memories, termbases and projects are all gone!  Lost inside the black box otherwise known as Trados Studio!

Continue reading “The Black Box…”

Introducing the multilingual XML… super filetype!

I was compelled to make a return to a previous theme around Marvel Comics because it’s the only way I can do justice to the amazing work the RWS AppStore team carry out on a daily basis.  There are some things you just can’t wait to get up in the morning for, and for me, one of these things is being able to work with this team on a daily basis.  The first meeting of every day for me is with this team and what a fantastic way to start the day it is!  I started this article by mentioning Marvel, but as you’ll see, the hero of this story is probably a Honey Badger!

Continue reading “Introducing the multilingual XML… super filetype!”

Let’s learn about XML…

This year at the Spring Trados Roadshows the emphasis was firmly placed upon education.  Almost all the presentations were based on providing translators, project managers, localization engineers etc. with great material to help them as they work with the Trados toolsets.

I had a few presentations at this event and decided it might be useful to post a few of them here, especially the ones that might help with some of the common filetype questions we see in the communities from time to time.

Continue reading “Let’s learn about XML…”