I love this cartoon with the husband and wife fishing on a calm weekend off.
“Honey, I got a big one on!”
She’s hooked a whopper and he casually responds in the way he always does when she occasionally catches a fish on Sunday morning.
“Yes dear, uh huh…”
The equipment they’ve got, from the boat to the fishing rods, is all perfectly suitable for their usual weekend activities but hopelessly inadequate for handling something like this! Little do they know that the whopper under the surface is going to give them a little more trouble when they try to bring him on board!
The analogy for me was brought home this week when I discovered that IATE (InterActive Terminology for Europe) had made a download available for the EU’s inter-institutional terminology database. You can go and get a copy right now from here – Download IATE . The file is only 118 Mb zipped so you can import it straight into your favourite CAT and use it as a glossary to help your productivity!
Great idea… but unzipped it’s a 2.2Gb TBX file containing around 8 million terms in 24 official EU languages. If you even have a text editor capable of opening a file like this (common favourites like NotePad++ can’t even open it) then you’ll see that this equates to over 60 million lines and goodness knows how many XML nodes. That’s quite a whopper and you’re never going to able to handle this in MultiTerm or any other desktop translating environment without tackling it in bite sized chunks.
But before reaching for your keyboard to find a better text editor you should also give some thought to what you want from this TBX. Do you really want all 24 EU langauges in the file, or do you just want two, three or four to help you with your work. Because if you want to only work with some of the languages then you have another problem because the specialist translating or terminology tools you have are probably unlikely to be able to handle this file either… at least not without chopping them into something smaller too. So you have a couple of operations to handle here and they are going to involve handling files of a substantial size.
My Suggestion
I decided to have a go at tackling this TBX after seeing a few comments from users who tried it and failed. Was it really impossible? To test it I cut out a section of the TBX to test with, and I had a couple of goes with this to get the size right for my tools, and for my patience. Patience, because when you ask your tools to handle files like this it can involve a lot of waiting, and in some cases a lot of waiting that still never succeeds at all! So the first additional tool I used was EditPadPro which is a text editor I already own. It was very pleasing when I could open the TBX completely with this and browse the entire contents. So thumbs up for the same developers who created RegexBuddy, another tool I mention from time to time. To cut out the bit I wanted it was easy. The TBX file is structured like this.
The header: this part always needs to be at the top of every file you recreate.
The overall structure: I don’t know what you’d call this part but below the header the entire contents sits within two elements; <text><body> and is finally closed off with the </martif> element at the end:
The terminology sits inside an element called the termEntry element. Each one of these elements has a unique ID, some other fields and the term itself in at least one language, but some times 24 languages. So for example, the structure for a single term might look like this. Two languages in this one, Czech and Polish, with a unique ID of “IATE-127”. It has a subject field at entry level and it has two data fields at term level for “termtype” and “reliabilityCode” which are some of the fields IATE use to help structure their terminology. This is explained here:
As long as you keep complete termEntry elements and the information within them then you can remove as many as you like to create the desired file size for import. So the downloaded TBX itself contains 60,587,834 lines. After a few tests later on I found what I thought would be a comfortable size to work with was around 1,000,000 lines. So for my first test I manually removed everything in the TBX below line number one million… ish. I saved this file with a new name and then tackled how to get it into Multiterm!
Structuring your Termbase
This link on the IATE website gives you enough information to manually create a termbase definition for MultiTerm to match the way the TBX is structured. However, I’m naturally suspicious of these things so I decided that one million lines would be a decent sample I could use to extract the definition file and then I could be sure I had the same structure as the one that was actually used in the TBX. To do this I didn’t use MultiTerm Convert. Instead I used the new version of the Glossary Converter as this has the ability to convert pretty much anything to anything and it can handle a TBX file pretty easily. All I do is drag the TBX onto the Glossary Converter and I am presented with this:
So my one million line sample file has terms for all the 24 languages in it plus two more. One was “la” or “Latin” which is not supported in MultiTerm anyway, and the other is “mul” or Multilingual. I hadn’t come across this abbreviation being used before and as it’s non specific it’s not interesting for me either. So I ignore both of these like this (just click the Ignore on/off button shown in the image above):
I also get four data fields:
- subject field (set at entry level in MultiTerm)
- termType (set as a term field)
- reliabilityCode (set as a term field)
- administrativeStatus (set as a term field)
If you had taken a look at the structure that was proposed in the links to the IATE website you would see that these are actually a little different. So if I had created my definition based on the website rather than allow the Glossary Converter to show me what was actually used I’d have possibly got it wrong!
Now the Glossary Converter had to parse the entire file to see what was in it and then report it in a meaningful way, so you can begin to see why it’s a problem to handle files much larger than my sample. Having said this, I wrote this article as I played with the file and as I get to the stage of seeing what’s in the termbase I was expecting I am presented with an error. The error says “File sharing lock count exceeded. Increase MaxLocksPerFile registry entry“. What the heck does that mean?? Fortunately the Glossary Converter help file under known issues gives me the answer. It seems to be a bug in either Multiterm or Microsoft Access (no clear pointers to which one) and there is a link to a Microsoft KB Article KB#815281. So I guess I can try to apply the fix and see if it works, or reduce the size of my TBX and try again. I figured it would be faster to reduce the size and make sure I don’t break anything else, so I cut it down 40% to around 600,000 lines and tried again. Certainly what takes the time here is the reorganising after import, and this is probably what caused the file sharing lock as well. However, 600,000 lines seemed to be ok (despite it taking 5 seconds to convert and then around 25 minutes to reorganise… you need some patience) and I got my termbase with getting on towards 10 thousand entries and the required structure to add more to it:
Statistics shown in MultiTerm
MultiTerm Termbase Definition
If you have trouble getting this far, because I think the file locking problem could vary from computer to computer then an alternative is to create the termbase definition file using the Glossary Converter, create a new Termbase in MultiTerm using that definition file, and then import the MultiTerm XML. I cover this further down because this is exactly how I intend to get more terms into the file.
Building on the Termbase
Now that I have the bones of the task and I understand how the TBX is structured I need to decide on a couple of things:
- Do I really want a huge termbase with all of these languages?
- How do go about splitting up the whole TBX into bite sized chunks?
I think in reality most translators are only going to be interested in two or three languages. But I’ll take FIGS for this exercise, so French, Italian, German, Spanish and of course English (fr, it, de, es, en). So I come back to my Glossary Converter but this time I ignore all the languages apart from the ones I want. This time when I create my termbase by dragging and dropping the TBX I get a reduced termbase with only the languages I’m interested in:
337 less entries, but I guess this is to be expected as some terms are probably only in other languages. The coverage is not bad though, with 84% being the lowest I’d need to make up if I wanted to fill in the gaps myself later. These may make up the more common languages in use at the EU, or at least the languages where people have gone to the trouble of capturing terminology.
Splitting the TBX
But now I have this how do I go about splitting up the 2.2 Gb TBX into bitesized chunks? Manually sounds too painful for me and I’m not skilled enough to write a program to do this for me, or even a script in EditPadPro. So I googled a little and tested a few bits of software that said they could split files, but none of them could handle the full 2.2 Gb TBX as it’s simply too large. But then I came across XML Split. This little tool is not free, but I think it’s worth every penny. I could use this to split up the entire TBX file into over a hundred bite sized chunks based on a number of term entries. The way it works is this, using a simple wizard to remove all the technical things it’s doing under the hood.
Select your file, tell it where the output should be and give a name for the output files. The application will number the files using the name you provide as the prefix:
You then have a choice of six different ways to split the file up, many of which may be useful for other types of tasks. But in this case the one I’m interested in is making sure that I maintain the structure of the TBX by splitting the file in groups of complete term entries. This is method 1. and the idea is that I tell it how many groups I want. I used 10000 because the import I tested to start with amounted to not far off 10000 (it was 8461 if you are struggling to match 10000 with the actual figure in my screenshot above – the FIGS termbase). I then have to tell the application what level this element is and what the name is. The level I tested a few times with a small file to establish it as three… so the application counts <martif><text><body> as the three. The name of course <termEntry>:
The next tab has a few options, but as you can see for methods 1 and 6 are to ensure the structure is preserved, and this is exactly what I want. The split files must have the header and the opening and closing elements around the term entries:
Finally I can preview the results first and check every file if I like, or I can just go ahead and split them. I did preview as I was testing this to see how it worked, but once I got my head around it I just went for the split:
Now, I should say that first of all I didn’t get all the file converted… it stopped at 106 TBX files. This in reality is still more than enough I think, because for my purposes 8 million terms might be a touch on the excessive side 😉 However, in the interest of completeness, and to make sure I understood all the difficulties you might encounter with a huge file like this I spoke to the developer of this application, Bill Conniff, who was very responsive and helpful. He even downloaded and investigated the TBX file for me using some of his other very interesting products and discovered that the TBX contains illegal characters that prevent the entire TBX from being processed. The error reported, “hexadecimal value 0x03 is an invalid character”, is outside the range allowed by the XML specification. The only characters allowed below hex 0x02 are the tab, line feed and carriage return. This tool uses Microsoft’s XMLReader object to parse the XML. It is fully XML compliant so when it reads an invalid character it reports the error and stops. This causes a problem for any tool that validates this information. Good to catch this now because this would almost certainly cause a problem for MultiTerm as well I think. If it doesn’t it would certainly be very surprising to see MultiTerm allow that much flexibility!
So, I removed some of the offending characters which lost me a few terms and tried again. This still created 106 TBX files in around 14 minutes, each containing around 10,000 term entries. I think cleaning the TBX to do better would take me too long, and this is enough for me as it proves the point and I’m not going to use all of these files in a MultiTerm termbase anyway; although it would be interesting to try and get them into MultiTerm Server at some point just to see if I can… a task for another day! For now I just want to move onto the next step, and that’s how do I go about adding some of these files into my existing IATE FIGS termbase.
Importing more terms
To import more terms all I have to do is convert these smaller TBX files into Multiterm XML, and to do this I’m going to use the Glossary Converter again. The option I need is this one which is a new feature the developer added in the recent version allowing you to convert pretty much anything related to terminology that’s in the list below into one of the formats listed below. I check MultiTerm termbase Definition file because this will create the MultiTerm XML file I need for import in addition to creating a MultiTerm XDT file that I could use to create a new Termbase with the correct structure if the TBX conversion I used at the start failed:
So now I can drag the second TBX file that was created (the contents of the first are already in my Termbase as I converted from TBX directly to a MultiTerm termbase) onto the Glossary Converter and then import the XML I get into my existing termbase. Don’t forget to ignore the languages you don’t want!
The conversion to XML and XDT takes a few seconds for each file, so this process is incredibly fast. I did the next two files and just deleted the XDT files so I only had the XML files I needed. To import them in Multiterm is simplicity, just open your termbase and go to the Termbase Management view (called Catalogue in MT 2011 and earlier) – I’m using MT 2014. Then as follows:
- Select “Import”
- Select the “Default import definition” (unless you have something clever planned!)
- Click on Process and select the first XML file for import
When you do this you have an option to use the Fast Import or not. I tried to use the Fast Import and a normal import without reorganisation because the reorganisation takes ages. But unfortunately even if you click the box to say don’t do a reorganisation it will do one anyway! (Bug is logged!) The import of each file based on the sizes I have, 10,000 term entries per file, takes around ten minutes per file on my laptop. So after importing these additional four files I now have 28,281 terms in my FIGS termbase distributed like this:
Just to wrap up this rather long article, long but hopefully not too complicated, here’s a step by step guide to the process I went through:
- Download the IATE TBX file
- Split up the IATE TBX into bit sized chunks
- Create your termbase with the definition you want (how many languages, which fields)
- Convert as many of the bite sized TBX files to Multiterm XML as you like
- Import the Multiterm XML to your termbase
So pretty simple really, and I hope this also sheds some light on why you can’t just take the whopping 2.2 Gb TBX containing around 8 million terms in 24 official EU languages and convert it in one go to a MultiTerm termbase. I did have a little play in the process of all of this with all 24 languages and have so far created a termbase with 129,965 entries. MultiTerm seems to be coping quite well so far, although I don’t know how many more I’ll try and add since the termbase itself is already 1.5 Gb in size… but the language coverage is interesting:
I also recorded a short video so you can see what this looks like in Studio 2014 and get an idea of how performant it is so far:
Hi Paul,
Very thorough and useful, as usual.
For those who are only interested in extracting their own language pair, an alternative is using Xbench. Xbench 3.0 64 bit is able to load the entire 2 GB file (you’ll have to wait several minutes before the list of languages shows). Then you select your source and target language, and Xbench finishes loading the file. Once your language pair is loaded Xbench easily allows exporting the language pair as a TMX memory.
Riccardo
Interesting Riccardo, especially if you want a TM from this. Although whether you want the TM or a termbase you’d still have to break the TMX down to bite sized chunks and convert it. What did you do with this?
Hi Paul,
Xbench (the 64-bit version) was able to convert the TBX file to TMX without the need for breaking it up in smaller chunks.
It was just a question of selecting the desired language pair, and then let Xbench load the file and export the results to TMX format.
So far I haven’t imported the TMX in Studio, but I have heard that a straight load into a TM resulted into a large number of import errors… I suspect that may be due to segments either the source or the target is empty.
I’ll document the whole process and post it on my blog, as soon as I have the time.
Hi Riccardo, so another TM use case and not terminology. This is quite interesting especially as there will be no translation logic behind the choice of which synonyms to pair up… and there are a lot of them.
You’re probably right on the errors as there will be a large number of terms without a matching target, or source. Bad for a TM but good for a Termbase where you can add missing translations later.
It sounds clear that the source file is bad — 8 bit ascii strings must have been copied directly into the xml elements which claim to hold utf-8 data. This is wrong, obviously they need to be converted into utf-8. Every tool has to deal with the bad data arbitrarily e.g. throw out some characters or throw out a text string or just refuse to handle it because the xml is lying about the encoding.
You guys know that now, but might I suggest that one of the tool owners get back to the creator/owner and tell them they screwed it up and need to recreate their data file? I know that you are just trying to work with the tools at your disposal to explore this, but the bad data really needs to be fixed at the source if they expect any tools to use this. It should be a very simple programming or scripting job to recreate this correctly.
I just received a very helpful email from the XML Split developer which I’m going to paraphrase here and I’ll also amend my post to clarify.
The error reported, “hexadecimal value 0x03 is an invalid character”, actually has nothing to do with the encoding of the file. The character
is outside the range allowed by the XML specification. The only characters allowed below hex 0x02 are the tab, line feed and carriage return. They use Microsoft’s XMLReader object to parse the XML. It is fully XML compliant so when it reads an invalid character it reports the error and stops. The specification states it should not continue reading after an error occurs.
The statement about utf-8 was a little ambiguous. He also references a useful article on his blog that lists the valid ranges: Removing Illegal Characters in XML Documents
One suggestion he had, if you don’t already own an XML editor is to use XMLMax. It would have reported the error because it uses XmlReader, and
offers an option to fix it, which opens the file in the text editor positioned at the error, or close enough to it that you can see it and resolve it.
Thanks Paul,
One question, when you talked about non-Latin characters in the tbx file. utf-8 uses the same 1-byte codes as ascii codes for 7-bit characters. The non-Latin (Cyrillic, Greek, Hebrew, etc) 8 bit ascii codes are translated into 2 byte characters in utf-8. Were you saying that the utf-8 2-byte codes for non-Latin characters were incorrect in the original tbx file, or that some of the tools do not correctly handle utf-8? If the data is correctly utf-8 encoded then 7 bit characters are 1 byte and are identical in utf-8 and ascii, 8 bit ascii characters become 2 bytes, CKJ Oriental language characters are 3 bytes (fwiw, these are usually 2 bytes in most pre-unicode, “oriental ascii” type character sets like MS shift-jis Japanese often used in windows).
If a tool treats characters as ascii and not as utf-8, then it should not in the general case be fed utf-8 encoded data, although it would work quite correctly on a utf-8 encoded file as long as the data was only 7-bit ascii, because those characters transfer 1-1 directly to utf-8 as the identical bytes. But the same tool would not handle utf-8 data containing non-latin characters nor CKJ characters. I’m not sure where you are saying the error lies, in the data or in some of the programs you used, or in trying to process utf-8 data using programs that were written to only handle ascii.
What you wrote is ambiguous where you say: “Since it is defined as UTF-8 this causes a problem for any tool that validates this information. “
I probably didn’t explain this as well as I should, but there are characters in the TBX like this for example (hopefully they display here):
<term>protocolo de alteração</term>
These are illegal in XML utf-8. So the XML Split tool reports this “hexadecimal value 0x03, is an invalid character.” and stops. I believe the same thing happen if you try and import the TBX into memoQ too. So I tried a TBX file with this in the Glossary Converter too and this also failed.
I reckon the content was screwed up before it went into the TBX, but in order to do something with the whole file from IATE you would have to find these illegal characters first and correct or remove them.
Sorry, my very bad! I totally confused the names. Too many versions of Trados with different names confuses me. I did not recognize your video as the Trados 2014 editor because I’ve only used Trados 2011 and the look is so different. After a quick look, I actually thought that was a screen from multiterm which I have not even used yet. The studio 2011 editor is actually what I was actually trying to ask about. I now realize that “Workbench” is a specific old product. Too bad because personally I really prefer the name “workbench” as a great descriptive name for a craft person’s custom work environment than “Studio Editor”. But I caused nothing but confusion and hence in future I’ll try to stick with the proper historic terminology.
Hi Paul, Did you try Xbench before embarking on your journey? Xbench can import the IATE TBX in around 2-3 minutes, presents you with a nice dialogue where you can choose the (2) languages you want, and can export as a TMX, tabbed UTF-8 .txt file and Excel file. Not yet sure what metadata you lose though. I’m currently testing this. By the way, we have been discussing this here: http://www.proz.com/forum/translator_resources/271879-part_of_the_iate_database_can_now_be_downloaded_as_a_massive_tbx.html + https://groups.google.com/forum/?fromgroups=#!topic/cafetranslators/lAopgfpC1Sw
Hi Michael, Riccardo asked me something similar and I have to admit I don’t know why this would be helpful for me. Xbench will only allow me one language pair, where the termbases I created were for all 24 languages together and a FIGS example with 5 of the 24 languages available to me. Xbench strips all the synonyms and if you look at the TBX in Multiterm you will see there are a lot of these. So how does it know which pair of translations go together? Notwithstanding this the alternatives are very useful and should be available to you. Xbench also strips all the metadata, so not helpful if it was important to me.
IATE is a terminology database and not a translation memory. So whilst it might have some value as a TMX, especially for some tools that use this format for a glossary as well (if they do?), I can see no logical reason to use Xbench at all. Even if all of the above was resolved and the TMX contained everything the TBX did. Now what? I have a 2.2Gb TMX and still no way to reflect this as a terminology database in a tool that is designed to handle terminology properly?
I’d be interested in your opinion and that of Riccardo though because converting termbases to translation memories comes up a lot. But this is not what I’m interested in… I want a proper termbase.
Oh yes… I tried to have a play with Xbench on the train and discovered it checks my license. Without an internet connection I had to wait till I reached the airport… quite frustrating! But I do like Xbench, it just doesn’t have a use for me in this scenario.
By the way, how does the size of this termbase compare to what Trados workbench has been tested with? This could be a very good stress test if you load all the language pairs into a single TB. I don’t have any experience with anything large but it seems to me that for practical use, unless Trados scales very gracefully, I would guess that most translators would probably be better off building a termbase with only a single or several languages pairs. But I’m sure you know about of pushing Trados to its limits and I’m curious.
Will trados have to load the entire termbase into memory or can it leave it on disk? Windows does in theory support virtual memory in the OS, however in practice I don’t want to try that because in my experience Windows gets totally bogged down to the point of being almost dead when physical memory is overcommitted. I have 6GB of physical memory and that’s fairly common now, but I usually don’t have as much as 2.2GB free.
In either case do you expect search time of such a large term base to happen in less that perceivable time or with a tiny delay or a noticeable delay? This is not a practical problem for me now, but it’s good to know whether I can use massive termbase and/or TM with no discomfort if I should need to.
I’ve no idea… but Trados Workbench is irrelevant here surely? We’re talking about MultiTerm. But generally you’re right. I think for most translators interested in this the exercise is to extract the languages they want and just work with these as I discussed in the article. How well MultiTerm will respond in practice I’m not sure as I’ve never put something this size into MultiTerm in its entirety before. But out of interest I tried it with the big termbase I created so far, so the one with 24 languages and around 130,000 entries. You can find a short video here if you’re interested to see what it looks like : IATE in Studio 2014.
Hi Paul,
As I said in my previous comment, Xbench can export the IATE .tbx file as: TMX, tabbed UTF-8 .txt file and Excel file. It’s only the tab-delimited text file I am interested in, as this is (one of) the termbase format(s) in CafeTran, and can be used to import into a termbase in pretty much any CAT tool. I never said I was going to use the database as a translation memory.
And yes, CafeTran has two kinds of termbases: these can be tab-delimited UTF-8 text files, or TMXs. The user is free to choose one. They each have their own specific benefits and CT users usually strongly believe in using only one of them, never both.
Also, I only translate from one language into one other language, so all I need is two languages. I’m a translator, not an agency.
I agree, though, the fact that Xbench doesn’t export all of the metadata – which I hinted at in my previous comment and which we have been discussing on Proz and in the CafeTran Google Group list – is a problem. I’m going to contact the Xbench people to see if they can do something about this. Igor (CafeTran’s developer) is also looking into this, to see if he can get CT to import the file properly. Currently, CT is stalling when importing the file, most likely due to some kind of error in the XML.
Hi Michael, I’m not very familiar with CafeTran or it’s glossaries, but I see where you’re coming from. I think having the multilingual capability in a proper terminology tool is still useful even for freelance translators. I know many with more than just a single language pair, and many more who work with others and maintain a multilingual capability.
I added a video to the end of the blog as I was answering a question over performance to another comment. I think it shows off quite nicely the capability of MultiTerm as a terminology tool that can take advantage of all the information in the TBX.
Just watched your video and that is pretty impressive.
Not sure how they compare, but my main CafeTran termbase contains around 350,000 entries (lines), many of them with quite a bit of metadata, and lookups are instantaneous while translating. My termbase is actually just a tab-delimited UTF-8 text file (which I have saved as a .csv, so it will auto-open in my CSV editor rather than in EmEditor), which is 25MB on disk. To edit it while in CafeTran, I can just right-click on my termbase pane and select ‘Edit’ and it will open up in Ron’s Editor (my CSV editor), where I can directly edit the file in a ‘columnar’ UI similar to Excel. I can also edit individual entries in the ‘Quick Term Editor’.
I just sent the following question to Xbench support:
————————-*
Hello,
I have been trying to import the recently downloadable IATE database, which can now be downloaded as a TBX from the IATE site, and noticed that Xbench isn’t importing the file properly. It is missing a lot of metadata and not at all seeing the way the synonyms are related. Many people are currently discussing this issue, e.g., here:
• http://multifarious.filkin.com/2014/07/13/what-a-whopper/
• http://www.proz.com/forum/translator_resources/271879-part_of_the_iate_database_can_now_be_downloaded_as_a_massive_tbx.html
• https://groups.google.com/forum/?fromgroups=#!topic/cafetranslators/WEoiqacrpo0
• https://www.youtube.com/watch?v=xDv-y0p0NXs&feature=youtu.be
I think it would be great if Xbench was changed so that it would be able to correctly import this file, and files of its kind. This would be a ‘unique selling point’ for Xbench, as there is currently no other (single) program that can do this.
Michael
————————-*
By the way, do you happen to know of a decent TBX editor? It seems that the EU and several other large organisations (see e.g. the ‘Microsoft Glossaries’) have started taking standards seriously, and have all decided to use TBX as their terminology format. I think it’s about time someone created a good TBX editor. I wonder if (the now dead) Heartsome Dictionary Editor could do this?
Michael
Paul, I thought that the normal translation workflow involves using the termbase within the translation editor, and that Multiterm is used to create the termbase only. As such Multiterm performance seems less critical to me as I would use MT once but then use the termbase day in day out in the workbench. So the larger concern is how a very large term base will perform if I use it in translation. Maybe I’m misunderstanding the work flow using Multiterm during translation, or maybe I confused you but asking the one-step-removed question which occurred to me when I saw that you were trying to create such a large termbase.
I’d have 2 concerns about such a large DB during translation — is the access super fast? Do I need to have enough main memory available to load the entire termbase in order to use it? Sorry for going a bit off-topic but I thought you might know the answers off the top of your head.
It does… I think you confused me bringing an old non-supported product into the equation. I recorded the video showing the Studio Editor and how the termbase I have created so far performs. I certainly won’t be testing this with Trados Workbench.
@Paul, if you need an emergency Internet connection for Xbench license check when in a train, one thing you can do if you have an smartphone at hand is to use it as a Wifi hotspot. A basic GPRS link should be enough for the sign in transaction. Once the license check is done, you can turn your smartphone off, and Xbench will work in offline mode for a few days. However, we still have to figure out the case where the user gets an Xbench license check while scuba diving, and it is a major issue because the Summer is already here 🙂
Indeed… and if I’d had a GPRS connection for long enough on that route to do this I surely would! But it’s good to know about the offline mode thing. Why won’t it do this by default anyway? Would save the hassle.
The scuba diving usecase is certainly going to be a problem!
Interesting developments. Xbench support just got answered me:
————————————————*
Hi Michael,
Thank you for your email. We published a new build of Xbench 3.0, which handles synonyms correctly. The 64-bit version is required to load the huge IATE .tbx file.
Download and install Xbench 3.0 build 1243 (64 bits). It is available at the http://www.xbench.net/index.php/download.
It takes some time for Xbench to look through the file to show the languages available at the IATE .tbx file at the “Select Languages” window.
I have attached a screenshot of one term (EN > ES), with synonyms and metadata. If some metadata is missing, please send us an example so that we can reproduce this issue.
Regards,
Oscar Martin,
The Xbench Team
————————————————*
Haven’t tried it yet as I am away from the office, but this looks promising!
As it was mentioned earlier, there’s more than one way to skin a cat. Here’s how someone did it for their preferred CAT on Linux.
”
As you may know, the IATE database was published in June in TBX format.
This the story of how I tackled it on Linux.
Objective:
1. Keep the TBX format as OmegaT can handle it.
2. Create bilingual files as OmegaT reads just the languages defined in the project properties.
3. Make the files as small as possible.
Let’s start with downloading (into a dedicated directory so that other files don’t interfere when working with it)
$ wget -O iate_download.zip http://iate.europa.eu/downloadTbx.do
Unpack:
$ unzip -L iate_download.zip
Archive: iate_download.zip
inflating: iate_export_25062014.tbx
I don’t like caps, so I used the -L switch.
What we got:
$ ls -sh
2.1G iate_export_25062014.tbx
As you can see the database is big… some say it’s huge. Just to open it in an editor could be a problem with a dual core 2100 MHz CPU and 4 GB RAM.
I decided to split the file into smaller ones to make the work easier.
The content and structure of the TBX file are described on the IATE pages:
http://iate.europa.eu/tbxPageDownload.do
http://iate.europa.eu/tbx/IATE%20Data%20Fields%20Explaind.htm
You can examine the structure with xmlstarlet too.
When I got the idea of how the internals looks like I chose to extract all term entries into separate files into a subdirectory and the tool of choice was csplit:
$ csplit -k -n 7 -f splits/iate_split_ iate_export_25062014.tbx ‘/<termEntry/' {*}
Options explained:
-k = do not remove output files on errors
-n 7 = every file gets a number appended, 2 digits are used default, but I knew I will need more
-f = prefix, or base file name, you can skip it and get the default 'xx'
The split took 15 minutes. The amount of data is big so using a machine like mine implies a lot of waiting… and much more of waiting.
Let's look at the number of files we got:
$ ls -1 splits/ | wc -l
1341626
one million three hundred forty-one thousand six hundred twenty-six
Also this number will cause issues.
Now we need to keep the firs file and the 3 lines of the last one – it's the header of the TBX file and the closing tags.
$ ls -1 splits/ | head -n 1
iate_split_0000000
$ ls -1 splits/ | tail -n 1
iate_split_1341625
I copied the files to another directory edited the last one and renamed it to iate_split_1341626.
Now the fun part. My target language is Slovak, so I only need to keep the entries containing Slovak and delete the rest.
The first attempt was simple, but failed:
$ rm $(grep -L "sk" splits/*)
bash: /bin/grep: Argument list too long
rm: missing operand
Try 'rm –help' for more information.
My solution was to read the file names into an array and feed grep in chunks:
I wrote a script a run it:
#!/bin/bash
files=(/home/user/iate/splits/*)
for ((i=0; i<${#files[*]}; i+=100)); do
rm -v $(grep -L "sk" "${files[@]:i:100}")
done
This deletes all files that don't contain my target language – Slovak ("sk").
But it takes painfully long – 12 hours. Luckily the process doesn't take that many resources and I was able to work with the computer running the process in the background. A good solution is to run it in a terminal multiplexer as screen or tmux. Thanks to the -v switch I was able to check the progress.
After the long 12 hours I ended with 28692 files resp. term entries.
Let's make a backup of those, so I don't need to wait so long again.
So now I have the term entries I want to keep, but they still contain languages I'm not interested in. Let's get rid of them:
In this case I will keep only German ("de"), multilingual ("mul") and Slovak of course 🙂
$ cd splits/
To work in the subdirectory with the splits.
$ for i in *; do xmlstarlet ed -L -O -d '/termEntry/langSet[@xml:lang="bg" or @xml:lang="cs" or @xml:lang="da" or @xml:lang="el" or @xml:lang="en" or @xml:lang="es" or @xml:lang="et" or @xml:lang="fi" or @xml:lang="fr" or @xml:lang="ga" or @xml:lang="hr" or @xml:lang="hu" or @xml:lang="it" or @xml:lang="la" or @xml:lang="lt" or @xml:lang="lv" or @xml:lang="mt" or @xml:lang="nl" or @xml:lang="pl" or @xml:lang="pt" or @xml:lang="ro" or @xml:lang="sl"]' $i; done
xmlstarlet is a tool which can be used to transform, query, validate, and edit XML documents and it is really handy.
This command will edit the file in place, omit the xml declaration, which is normally inserted by the tool automatically and will delete the entries defined by the xpath value.
It gave some errors, I identified the files
iate_split_0442688:177
iate_split_1331208:38
iate_split_1341625:13
checked them
$ sed -n 177p iate_split_0442688
„prea mare pentru a se prăbuși”
$ sed -n 38p iate_split_1331208
Σχέδιο δράσης της ΕΕ για το ρόλο της ισότητας των φύλων και της χειραφέτησης των γυναικών στην ανάπτυξη
$ sed -n 13p iate_split_1341625
and realised it’s sort of false encoding in 2 files in parts I wont keep anyway. If you want to keep the Greek resp. Romanian terms you will need to correct them, I just deleted them:
$ sed -i ’38d’ iate_split_133120
$ sed -i ‘177d’ iate_split_0442688
And the third file was the last one with “trailing” tags, so I deleted those:
$ sed -i ’13,$d’ iate_split_1341625
I run the previous xmlstarlet command on those three files too and had almost reached my goal.
But there were still this multilingual entries which will Omegat ignore in a DE-SK project.
I decided to make those to source language entries, so I converted the “mul” declaration into “de”.
$ sed -i ‘s/””mul””/””de””/’ *
I have also no use for entries containing only Slovak terms:
$ rm $(grep -rL “”de”” *)
Let’s join the remaining files now:
First you need to copy the preserved first and last files – the header and the closing tags – back into the subfolder with all the splits, then all you need is:
$ cat * > iate_de_sk
I realized that most or all of the metadata are not of use for me, so I used xmlstarlet again to remove those parts from the TBX:
$ xmltarlet ed -L -O -d ‘/martif/text/body/termEntry/descripGrp’ iate_de_sk
$ xmlstarlet ed -L -O -d ‘/martif/text/body/termEntry/langSet/tig/termNote’ iate_de_sk
$ xmlstarlet ed -L -O -d ‘/martif/text/body/termEntry/langSet/tig/descrip’ iate_de_sk
The file was now 8.7 MB, still relatively big, but it contained a lot of whitespace.
I stripped it:
$ sed -i -e ‘s/^[ ]*//’ -e ‘s/[ ]*$//’ iate_de_sk
and the newlines as well:
$ tr -d ‘
‘ iate_de_sk_2
and ended with 5.4 MB.
It looked great, just the encoding was not the right one, but it was easy to fix:
xmlstarlet fo -o -e utf-8 iate_de_sk_2 > iate_de_sk.tbx
Did the same to get the glossaries for my other working languages and now I have perfect working IATE TBX files for use in OmegaT. Goal accomplished 🙂
Maybe I did it a little bit complicated… Do you have a better way?
__._,_.___
Posted by: ymilos@yahoo.com
“
Hi Paul,
Hey, can you give me a link to that FIGS-only version of the TBX you made, and/or just a DutchEnglish one? I am playing around with opening the original IATE TBX in Heartsome Translation Studio, but can’t remember where you posted them. The original is too big to be opened in HTS.
Michael
Oops, sorry, found it already: http://multifarious.filkin.com/2014/07/22/a-few-bilingual-tbx-resources/
The glossary converter link is dead.
Thanks Piotr… should be good now. The Charity Edition is no more and now it’s part of the free version. So I changed the link to the latest version.