Say goodbye to doppelgangers with ChatGPT…

With all the excitement and interest around ChatGPT these days, and with the numerous interesting projects we’re working on at RWS that involve the use of AI, its hard not to allow the lure of the technology to pull you in.  Last month I wrote an article on corrupt translation memories, and in doing this I dabbled a little with SQLite queries… an SDLTM is a SQLite database.  They are pretty much new to me as I’ve only used very simple statements… one liner lookups for example… in the past.  So I had to read some basics and learn a little when I did that.  Nothing new, I like to read this sort of material and learn a little something new anyway.  So when I was asked this evening how do you remove translation units from a translation memory that have the same source and same target I immediately started to think SQLite.

Why SQLite and not the product?  Well, surprisingly enough there is no function in Trados Studio to do this.  The maintenance features in the Translation Memory view are fairly basic, and even more surprisingly we have never built a small tool to enhance this with the AppStore Team.  So back to the SQLite documentation to figure out how to do this.

Safety First

The first step in carrying out any such maintenance directly on an SDLTM is to take a copy of your TM and NEVER EVER do this without safely backing up first!!  Once you’re working on a copy you can do whatever you like!

Working with SQLite

To work with SQLite I’m finding that the first thing to do is to make sure you have a sensible SQLite editor to make it easier.  I use the excellent DB Browser for SQLite which seems to give me everything I need at the moment.  All I need now is a translation memory, so I create a TMX from the DGT collection, English to French, containing around 76794 translation units and upgrade it into Trados Studio.  The upgrade cleans out invalid, blank, number only etc. units and this leaves me a translation memory holding 63044 translation units.  Perfect for testing with.

I then set about trying to learn how to find translation units that have the same source/target content.  After some reading I decided I really needed to use a regular expression to find what I needed since the content I’m looking for is actually wrapped in an XML wrapper within the SQLite table rows.  Something like this:

To get at this I figured I could use a regular expression, and came up with this:

(?<=^.+?<Value>).+?(?=</Value>)

That does a pretty good job of extracting only the text I’m interested in:

My problem is figuring out how to use this in a SQL Query.  After 15 minutes or so (I am short on patience…) I came up with this:

SELECT source_segment, target_segment
FROM translation_units
WHERE source_segment REGEXP (?<=^.+?<Value>).+?(?=</Value>) = target_segment REGEXP (?<=^.+?<Value>).+?(?=</Value>);

Which duly returned this error in the DB Browser application:

A little more research and I learn there are only limited options for regex in SQLite which I have definitely exceeded here… and to be honest I’m not even confident of my syntax for the rest of the query.

ChatGPT to the rescue!

So I decided to see if I could solve this by asking my new OpenAI friend, ChatGPT.  I ask it what’s wrong with my query?

Well… wow!  Every time I look at this I’m amazed at how good the answers are.  At least they always look good and this can be deceiving.  Certainly a lot of the tests we have done at work prove how often it can be eloquently wrong as well as perfectly brilliant!  So let’s take a look at this suggested code which I can even conveniently copy with a single click:

SELECT source_segment, target_segment
FROM translation_units
WHERE substr(source_segment, instr(source_segment, ‘<Value>’) + 7, instr(source_segment, ‘</Value>’) – instr(source_segment, ‘<Value>’) – 7) =
substr(target_segment, instr(target_segment, ‘<Value>’) + 7, instr(target_segment, ‘</Value>’) – instr(target_segment, ‘<Value>’) – 7);

I paste it into the DB Browser and execute it:

It runs perfectly and finds 46 rows in my 63 thousand TU translation memory that are same source same target, and it finds them in 592ms.  I checked them and they’re correct!

So I naturally ask it how I would adapt the previous query to delete each row in the database table:

Once again it obliges instantly with the requested statement:

DELETE FROM translation_units
WHERE substr(source_segment, instr(source_segment, ‘<Value>’) + 7, instr(source_segment, ‘</Value>’) – instr(source_segment, ‘<Value>’) – 7) =
substr(target_segment, instr(target_segment, ‘<Value>’) + 7, instr(target_segment, ‘</Value>’) – instr(target_segment, ‘<Value>’) – 7);

Looks simple enough, so I execute that one too!

In 492ms I deleted all 46 rows from my TM.  I then open it up in the Translation Memories View of Trados Studio and check it works.  Everything looks good, so I check the number of TUs and it’s still 63044 TUs.

So then I ask my AI friend if there is anything I need to do after removing rows in a SQLite database table to ensure it’s properly prepared for use:

The answer is again pretty good and I execute the VACUUM command which completes in 22853ms:

Certainly I am learning some stuff here and it is very specific to my actual task every time.  I now open in Trados Studio again and it’s still showing 63044 TUs.  So I do the obvious and recompute the fuzzy index statistics, and reindex the translation memory.  This returns exactly what I hoped to see:

And now, after closing the settings window and opening it again the number of translation units correctly show 62998.  I’m good to go!

The steps…

I think at this stage a quick recap would be useful of just what the steps for removing same source same target (SSST) translation units (TUs) would be:

  1. take a backup of your SDLTM
  2. open the SDLTM in the DB Browser
  3. optionally check for these SSST TUs by executing this query:
    1. SELECT source_segment, target_segment
      FROM translation_units
      WHERE substr(source_segment, instr(source_segment, ‘<Value>’) + 7, instr(source_segment, ‘</Value>’) – instr(source_segment, ‘<Value>’) – 7) =
      substr(target_segment, instr(target_segment, ‘<Value>’) + 7, instr(target_segment, ‘</Value>’) – instr(target_segment, ‘<Value>’) – 7);
  4. if there are any delete them with this statement:
    1. DELETE FROM translation_units
      WHERE substr(source_segment, instr(source_segment, ‘<Value>’) + 7, instr(source_segment, ‘</Value>’) – instr(source_segment, ‘<Value>’) – 7) =
      substr(target_segment, instr(target_segment, ‘<Value>’) + 7, instr(target_segment, ‘</Value>’) – instr(target_segment, ‘<Value>’) – 7);
  5. execute the VACUUM command to optimize the TM database, reclaim any unused space and improve the performance of the TM
  6. open the TM in Trados Studio in the Translation Memory View and “Recompute Fuzzy Index Statistics…”, then “Re-index Translation Memory”
  7. note the correct number of deleted TUs from the DB Browser match the number shown in Trados Studio

Now you know this, the final thing I’d say is you do it at your own risk!!  So backup… backup… backup before you try it.  Certainly I’m personally encouraged by this exercise to see what else I can learn to do in terms of managing the content in a TM with SQLite… and I’ll also see what we can do in terms of making a robust solution via the appstore!

The takeaway…

Well, there is no denying how useful AI, and this OpenAI tool, ChatGPT, is going to be.  I’ve been playing with it for a while and asking it questions around everything I get involved with just to see how it responds.

Tonight I have literally gone from trying to tackle this real question I was asked about deleting translation units in a translation memory , to writing up this article in around 3-hours despite being pretty ignorant of how to run SQLite queries like this.  Admittedly I did know how to write the regular expression needed to feed into ChatGPT and I did know how to resolve the problem of recomputing fuzzy indexes etc.  But even though I have not checked I’m pretty sure I could have got there without even this rudimentary knowledge just by asking the questions in plain language.

But it is dangerous and should be used with care.  I think in the hands of someone who can apply it sensibly and ethically it’s a fantastic tool to have at your disposal.  But there’s also a very negative aspect to it ranging from the negative effect this could have on our children’s ability to think for themselves, to massive amounts of misleading and incorrect content appearing in what’s already an overwhelming barrage of information and stimuli, often negative, from multiple social media platforms…. and more!

To finish off I asked ChatGPT about this very question… and its answer was:

AI solutions like ChatGPT can be used ethically and without harming people if they are developed, trained, and deployed with careful consideration for ethical and safety concerns. Some best practices for ethical AI include: ensuring fairness and avoiding discrimination, protecting privacy, being transparent about how AI systems make decisions, and avoiding potential harm to individuals or society. However, the ultimate responsibility for ethical AI usage lies with the people and organizations using these technologies, and they should continuously monitor and assess the impacts of AI systems to ensure they align with ethical principles and values.

So its down to us… I’m going to try and remain optimistic!

TM corruption…

This is a topic that probably occurred a lot more in the old days of Trados and Translators Workbench where it was relatively easy to corrupt a translation memory.  In those days the translation memory consisted of five file types with the extensions .tmw, .mwf, .mtf, .mdf and .iix and when problems did occur it was probably related to the files that supported indexing and lookup speeds for example.  The .tmw file itself that contained the translation units was less likely to be the source of the problem.  So fixing it could often be achieved by creating a new translation memory with the same settings, and then simply replacing the .tmw in the new translation memory with the old one… finally reorganising.  This didn’t always help, but if often did!

Continue reading “TM corruption…”

The Black Box…

An interesting thing about Trados Studio is how easy it makes it for new users to save all their projects in one place, all their translation memories and all their termbases.  In fact it’s so easy that I’m absolutely certain many users never give these things a second thought as they start to get to grips with using the software itself.  But then, one day, these users upgrade… in some cases even the upgrade seems to happen automagically… and then the problems start.  After becoming a competent Trados Studio user, excited to try the latest version, all of a sudden the translation memories, termbases and projects are all gone!  Lost inside the black box otherwise known as Trados Studio!

Continue reading “The Black Box…”

Short term memories…

“Not only is my short-term memory horrible, but so is my short-term memory.”  I have no idea who this quote can be attributed to, and its certainly not original, but it is quite appropriate when I start to think about the evolution of Trados.  Ever since Trados Studio was launched you can be sure to find many “experts” in places like ProZ and even the SDL Community recommending you don’t upgrade because there is no difference compared to the last version.  To be fair, if you only use a fraction of the features despite having used the software for a decade, then it probably is like this.  The alternative being these “experts” have very short-term memories.

Continue reading “Short term memories…”

Some you win… some you lose

When we released the new Trados 2021 last week I fully intended to make my first article, after the summary of the release notes, to be something based around the new appstore integration.  The number of issues we are seeing with this release are very low which is a good thing, but nonetheless I feel compelled to tackle one thing first that has come up a little in the forums.  It relates to some changes made to improve the product for the many.

Continue reading “Some you win… some you lose”

The “Old Farts Language Code Club”…

The Old Farts Language Code Club! This is a new club, inspired by a comment in the SDL Community from a prospective member. I’m not sure yet of the age at which you can qualify for membership, but in addition to the age requirements, which may have stringent rules to prevent any young whippersnappers from joining, it’s essential that prospective members have a good grasp of the language codes used in Trados Studio. I’m also not sure of the demand, so I may open a waiting list that could include anyone who already makes good use of the language codes in Trados Studio but isn’t an old fart yet!

Continue reading “The “Old Farts Language Code Club”…”

Apply a TM Template

Ever since the release of Studio 2009 we have had the concept of Language Resource Templates, and ever since the release of Studio 2009 I’d risk a bet that most users don’t know what they’re for or how to use them.  To be fair this is hardly a surprise since their use is actually quite limited out of the box and access to the goodies inside is pretty hard to get at.  It’s been something I used to see users complain about a long time ago but for some years now I rarely see them mentioned anymore.  This article, I hope, might change that.

Continue reading “Apply a TM Template”

Slicing fruit!

If there’s one thing I firmly believe it’s that I think all translators should learn a little bit of regex, or regular expressions.  In fact it probably wouldn’t hurt anyone to know how to use them a little bit simply because they are so useful for manipulating text, especially when it comes to working in and out of spreadsheets.  When I started to think about this article today I was thinking about how to slice up text so that it’s better segmented for translation; and I was thinking about what data to use.  I settled on lists of data as this sort of question comes up quite often in the community and to create some sample files I used this wikipedia page.  It’s a good list, so I copied it as plain text straight into Excel which got me a column of fruit formatted exactly as I would like to see it if I was translating it, one fruit per segment.  But as I wanted to replcate the sort of lists we see translators getting from their customers I copied the list into a text editor and used regex to replace the hard returns (
) with a comma and a space, then broke the file up alphabetically… took me around a minute to do.  I’m pretty sure that kind of simple manipulation would be useful for many people in all walks of life.  But I digress….

Continue reading “Slicing fruit!”

Too many language variants!

Is English (Europe) the new language on the other side of the Channel that we’ll all have to learn if Brexit actually happens… will Microsoft ever create a spellchecker for it now they added it to Windows 10?  Why are there 94 different variants of English in Studio coming from the Microsoft operating system and only two Microsoft Word English spellcheckers?  Why don’t we have English (Scouse), English (Geordie) or English (Brummie)… probably more distinct than the differences between English (United States) and English (United Kingdom) which are the two variants Microsoft can spellcheck.  These questions, and similar ones for other language variants are all questions I can’t answer and this article isn’t going to address!  But I am going to address a few of the problems that having so many variants can create for users of SDL Trados Studio.

Continue reading “Too many language variants!”