I love this cartoon with the husband and wife fishing on a calm weekend off.
“Honey, I got a big one on!”
She’s hooked a whopper and he casually responds in the way he always does when she occasionally catches a fish on Sunday morning.
“Yes dear, uh huh…”
The equipment they’ve got, from the boat to the fishing rods, is all perfectly suitable for their usual weekend activities but hopelessly inadequate for handling something like this! Little do they know that the whopper under the surface is going to give them a little more trouble when they try to bring him on board!
The analogy for me was brought home this week when I discovered that IATE (InterActive Terminology for Europe) had made a download available for the EU’s inter-institutional terminology database. You can go and get a copy right now from here – Download IATE . The file is only 118 Mb zipped so you can import it straight into your favourite CAT and use it as a glossary to help your productivity!
Great idea… but unzipped it’s a 2.2Gb TBX file containing around 8 million terms in 24 official EU languages. If you even have a text editor capable of opening a file like this (common favourites like NotePad++ can’t even open it) then you’ll see that this equates to over 60 million lines and goodness knows how many XML nodes. That’s quite a whopper and you’re never going to able to handle this in MultiTerm or any other desktop translating environment without tackling it in bite sized chunks.
But before reaching for your keyboard to find a better text editor you should also give some thought to what you want from this TBX. Do you really want all 24 EU langauges in the file, or do you just want two, three or four to help you with your work. Because if you want to only work with some of the languages then you have another problem because the specialist translating or terminology tools you have are probably unlikely to be able to handle this file either… at least not without chopping them into something smaller too. So you have a couple of operations to handle here and they are going to involve handling files of a substantial size.
I decided to have a go at tackling this TBX after seeing a few comments from users who tried it and failed. Was it really impossible? To test it I cut out a section of the TBX to test with, and I had a couple of goes with this to get the size right for my tools, and for my patience. Patience, because when you ask your tools to handle files like this it can involve a lot of waiting, and in some cases a lot of waiting that still never succeeds at all! So the first additional tool I used was EditPadPro which is a text editor I already own. It was very pleasing when I could open the TBX completely with this and browse the entire contents. So thumbs up for the same developers who created RegexBuddy, another tool I mention from time to time. To cut out the bit I wanted it was easy. The TBX file is structured like this.
The header: this part always needs to be at the top of every file you recreate.
The overall structure: I don’t know what you’d call this part but below the header the entire contents sits within two elements; <text><body> and is finally closed off with the </martif> element at the end:
The terminology sits inside an element called the termEntry element. Each one of these elements has a unique ID, some other fields and the term itself in at least one language, but some times 24 languages. So for example, the structure for a single term might look like this. Two languages in this one, Czech and Polish, with a unique ID of “IATE-127″. It has a subject field at entry level and it has two data fields at term level for “termtype” and “reliabilityCode” which are some of the fields IATE use to help structure their terminology. This is explained here:
As long as you keep complete termEntry elements and the information within them then you can remove as many as you like to create the desired file size for import. So the downloaded TBX itself contains 60,587,834 lines. After a few tests later on I found what I thought would be a comfortable size to work with was around 1,000,000 lines. So for my first test I manually removed everything in the TBX below line number one million… ish. I saved this file with a new name and then tackled how to get it into Multiterm!
Structuring your Termbase
This link on the IATE website gives you enough information to manually create a termbase definition for MultiTerm to match the way the TBX is structured. However, I’m naturally suspicious of these things so I decided that one million lines would be a decent sample I could use to extract the definition file and then I could be sure I had the same structure as the one that was actually used in the TBX. To do this I didn’t use MultiTerm Convert. Instead I used the new version of the Glossary Converter as this has the ability to convert pretty much anything to anything and it can handle a TBX file pretty easily. All I do is drag the TBX onto the Glossary Converter and I am presented with this:
So my one million line sample file has terms for all the 24 languages in it plus two more. One was “la” or “Latin” which is not supported in MultiTerm anyway, and the other is “mul” or Multilingual. I hadn’t come across this abbreviation being used before and as it’s non specific it’s not interesting for me either. So I ignore both of these like this (just click the Ignore on/off button shown in the image above):
I also get four data fields:
- subject field (set at entry level in MultiTerm)
- termType (set as a term field)
- reliabilityCode (set as a term field)
- administrativeStatus (set as a term field)
If you had taken a look at the structure that was proposed in the links to the IATE website you would see that these are actually a little different. So if I had created my definition based on the website rather than allow the Glossary Converter to show me what was actually used I’d have possibly got it wrong!
Now the Glossary Converter had to parse the entire file to see what was in it and then report it in a meaningful way, so you can begin to see why it’s a problem to handle files much larger than my sample. Having said this, I wrote this article as I played with the file and as I get to the stage of seeing what’s in the termbase I was expecting I am presented with an error. The error says “File sharing lock count exceeded. Increase MaxLocksPerFile registry entry“. What the heck does that mean?? Fortunately the Glossary Converter help file under known issues gives me the answer. It seems to be a bug in either Multiterm or Microsoft Access (no clear pointers to which one) and there is a link to a Microsoft KB Article KB#815281. So I guess I can try to apply the fix and see if it works, or reduce the size of my TBX and try again. I figured it would be faster to reduce the size and make sure I don’t break anything else, so I cut it down 40% to around 600,000 lines and tried again. Certainly what takes the time here is the reorganising after import, and this is probably what caused the file sharing lock as well. However, 600,000 lines seemed to be ok (despite it taking 5 seconds to convert and then around 25 minutes to reorganise… you need some patience) and I got my termbase with getting on towards 10 thousand entries and the required structure to add more to it:
Statistics shown in MultiTerm
MultiTerm Termbase Definition
If you have trouble getting this far, because I think the file locking problem could vary from computer to computer then an alternative is to create the termbase definition file using the Glossary Converter, create a new Termbase in MultiTerm using that definition file, and then import the MultiTerm XML. I cover this further down because this is exactly how I intend to get more terms into the file.
Building on the Termbase
Now that I have the bones of the task and I understand how the TBX is structured I need to decide on a couple of things:
- Do I really want a huge termbase with all of these languages?
- How do go about splitting up the whole TBX into bite sized chunks?
I think in reality most translators are only going to be interested in two or three languages. But I’ll take FIGS for this exercise, so French, Italian, German, Spanish and of course English (fr, it, de, es, en). So I come back to my Glossary Converter but this time I ignore all the languages apart from the ones I want. This time when I create my termbase by dragging and dropping the TBX I get a reduced termbase with only the languages I’m interested in:
337 less entries, but I guess this is to be expected as some terms are probably only in other languages. The coverage is not bad though, with 84% being the lowest I’d need to make up if I wanted to fill in the gaps myself later. These may make up the more common languages in use at the EU, or at least the languages where people have gone to the trouble of capturing terminology.
Splitting the TBX
But now I have this how do I go about splitting up the 2.2 Gb TBX into bitesized chunks? Manually sounds too painful for me and I’m not skilled enough to write a program to do this for me, or even a script in EditPadPro. So I googled a little and tested a few bits of software that said they could split files, but none of them could handle the full 2.2 Gb TBX as it’s simply too large. But then I came across XML Split. This little tool is not free, but I think it’s worth every penny. I could use this to split up the entire TBX file into over a hundred bite sized chunks based on a number of term entries. The way it works is this, using a simple wizard to remove all the technical things it’s doing under the hood.
Select your file, tell it where the output should be and give a name for the output files. The application will number the files using the name you provide as the prefix:
You then have a choice of six different ways to split the file up, many of which may be useful for other types of tasks. But in this case the one I’m interested in is making sure that I maintain the structure of the TBX by splitting the file in groups of complete term entries. This is method 1. and the idea is that I tell it how many groups I want. I used 10000 because the import I tested to start with amounted to not far off 10000 (it was 8461 if you are struggling to match 10000 with the actual figure in my screenshot above – the FIGS termbase). I then have to tell the application what level this element is and what the name is. The level I tested a few times with a small file to establish it as three… so the application counts <martif><text><body> as the three. The name of course <termEntry>:
The next tab has a few options, but as you can see for methods 1 and 6 are to ensure the structure is preserved, and this is exactly what I want. The split files must have the header and the opening and closing elements around the term entries:
Finally I can preview the results first and check every file if I like, or I can just go ahead and split them. I did preview as I was testing this to see how it worked, but once I got my head around it I just went for the split:
Now, I should say that first of all I didn’t get all the file converted… it stopped at 106 TBX files. This in reality is still more than enough I think, because for my purposes 8 million terms might be a touch on the excessive side ;-) However, in the interest of completeness, and to make sure I understood all the difficulties you might encounter with a huge file like this I spoke to the developer of this application, Bill Conniff, who was very responsive and helpful. He even downloaded and investigated the TBX file for me using some of his other very interesting products and discovered that the TBX contains illegal characters that prevent the entire TBX from being processed. The error reported, “hexadecimal value 0x03 is an invalid character”, is outside the range allowed by the XML specification. The only characters allowed below hex 0x02 are the tab, line feed and carriage return. This tool uses Microsoft’s XMLReader object to parse the XML. It is fully XML compliant so when it reads an invalid character it reports the error and stops. This causes a problem for any tool that validates this information. Good to catch this now because this would almost certainly cause a problem for MultiTerm as well I think. If it doesn’t it would certainly be very surprising to see MultiTerm allow that much flexibility!
So, I removed some of the offending characters which lost me a few terms and tried again. This still created 106 TBX files in around 14 minutes, each containing around 10,000 term entries. I think cleaning the TBX to do better would take me too long, and this is enough for me as it proves the point and I’m not going to use all of these files in a MultiTerm termbase anyway; although it would be interesting to try and get them into MultiTerm Server at some point just to see if I can… a task for another day! For now I just want to move onto the next step, and that’s how do I go about adding some of these files into my existing IATE FIGS termbase.
Importing more terms
To import more terms all I have to do is convert these smaller TBX files into Multiterm XML, and to do this I’m going to use the Glossary Converter again. The option I need is this one which is a new feature the developer added in the recent version allowing you to convert pretty much anything related to terminology that’s in the list below into one of the formats listed below. I check MultiTerm termbase Definition file because this will create the MultiTerm XML file I need for import in addition to creating a MultiTerm XDT file that I could use to create a new Termbase with the correct structure if the TBX conversion I used at the start failed:
So now I can drag the second TBX file that was created (the contents of the first are already in my Termbase as I converted from TBX directly to a MultiTerm termbase) onto the Glossary Converter and then import the XML I get into my existing termbase. Don’t forget to ignore the languages you don’t want!
The conversion to XML and XDT takes a few seconds for each file, so this process is incredibly fast. I did the next two files and just deleted the XDT files so I only had the XML files I needed. To import them in Multiterm is simplicity, just open your termbase and go to the Termbase Management view (called Catalogue in MT 2011 and earlier) – I’m using MT 2014. Then as follows:
- Select “Import”
- Select the “Default import definition” (unless you have something clever planned!)
- Click on Process and select the first XML file for import
When you do this you have an option to use the Fast Import or not. I tried to use the Fast Import and a normal import without reorganisation because the reorganisation takes ages. But unfortunately even if you click the box to say don’t do a reorganisation it will do one anyway! (Bug is logged!) The import of each file based on the sizes I have, 10,000 term entries per file, takes around ten minutes per file on my laptop. So after importing these additional four files I now have 28,281 terms in my FIGS termbase distributed like this:
Just to wrap up this rather long article, long but hopefully not too complicated, here’s a step by step guide to the process I went through:
- Download the IATE TBX file
- Split up the IATE TBX into bit sized chunks
- Create your termbase with the definition you want (how many languages, which fields)
- Convert as many of the bite sized TBX files to Multiterm XML as you like
- Import the Multiterm XML to your termbase
So pretty simple really, and I hope this also sheds some light on why you can’t just take the whopping 2.2 Gb TBX containing around 8 million terms in 24 official EU languages and convert it in one go to a MultiTerm termbase. I did have a little play in the process of all of this with all 24 languages and have so far created a termbase with 129,965 entries. MultiTerm seems to be coping quite well so far, although I don’t know how many more I’ll try and add since the termbase itself is already 1.5 Gb in size… but the language coverage is interesting:
I also recorded a short video so you can see what this looks like in Studio 2014 and get an idea of how performant it is so far: