A few bilingual TBX resources

01Since writing my last article on handling large TBX files I have extracted a few TBX files as language pairs only from the very large TBX provided by IATE and thought I would share them here for others to use.  If you want a specific language pair from the 25 languages within the IATE TBX then drop a note into the comments.  I can’t guarantee I’ll do it quickly, but as the process is fairly straightforward I will add them from time to time.

All of the files below are extracted from the following original: Download IATE, European Union, [2014]

Also note that because of the nature of this TBX not all languages are equal.  This means there will be many more English terms in the TBX than anything else.  This gives you two options:

  1. Import the TBX into your favourite tool and complete the missing terms for your own personal use or;
  2. Remove the monolingual entries before import so you have a 100% populated termbase (straightforward if you convert the TBX to Excel first)

You may also find that some entries contain invalid XML that could prevent the import of the TBX into many validating tools.  If this happens you will have to remove the offending entries with a text editor first.  Hopefully you will have an editor that reports the problem and explains where it is.  If you are intending to import this into Multiterm they will probably still be too large as they are, please refer to this article of instructions on how to break them down into bite sized chunks.

Update Date: 150222

Since writing this article the team responsible for IATE have created a tool you can download that allows you to extract single pairs in any combination.  It’s called IATExtract.  So I won’t be extracting anymore from this date.  The pairs below were requested in the comments so I have done these, but I won’t be doing any more.  Notwithstanding this the latest download from IATE will contain more terms anyway so you are best to take your information from their website.

 

Available pairs (Date: 150222)

Just click on the pair you want to download and you should get a zip file containing the extracted TBX file.  If you find there is no extension on the file inside the zip just add a TBX extension to it after extracting it to a folder on your computer… I know I forgot to rename at least one so there may be more!

English <> Czech

English <> Danish

English <> Dutch

English <> French

English <> German

English <> Greek

English <> Italian

English <> Polish

English <> Portuguese

English <> Romanian

English <> Spanish

English <> Swedish

French <> Dutch

French <> Polish

French <> Portuguese

German <> Czech

German <> Danish

German <> Dutch

German <> French

German <> Slovak

Italian <> Czech

Italian <> Portuguese

Polish <> Dutch

Spanish <> French

Spanish <> Netherlands

 FIGS (Date: 140723)

Not a bilingual pair, but as I have it already I have also loaded a TBX containing only FR, IT, DE, ES and EN.  Might be useful for anyone looking for this combination.

FIGS + English

62 comments
  1. Bo said:

    Thanks for doing this, and for your articles. Well, I tried to get the Danish file, without much luck, so if you are able to do it I would certainly appreciate it.

    Like

    • Thanks… you didn’t mention the other language so I assumed English! It’s now there.

      Like

      • Bo said:

        Wow, that was fast. Yes, that is my languages. Just downloaded the file, and will try to import it tonight. Now I know where to look for instructions. Thanks again.

        Like

  2. Very useful, Paul. Thank you very much.
    I should also be interested in German Dutch, if you find the time for that.
    Best regards,
    AbdulRahiem

    Like

  3. Louisa Fox said:

    English German please 🙂

    Like

  4. Francisco said:

    Hi Paul, thanks a lot for this great job and all these files. Could you please add the English-Spanish pair ? Thanks in advance and keep on going. Cheers,

    Like

  5. Hi Paul! Thank you for your valuable guides. I would really appreciate it if you could extract the English-Romanian version from the IATE TBX resource.

    Like

  6. Francisco said:

    Post-scriptum: the Spanish-French pair will be very userful too, if it is not too much for you… and if you have sufficient spare time ! 🙂

    Like

  7. Great stuff, Paul.

    It’d be nice to have English Italian and English Greek sometime.

    Fantastic job!

    Giles

    Like

      • Thanks a bunch, Paul.

        Buona giornata,

        Giles

        2014-07-22 9:38 GMT+02:00 multifarious :

        > paulfilkin commented: “Thanks Giles… both pairs added.” >

        Like

  8. Manicle said:

    Great Paul, thanks ! German-French & Dutch-French would be highly appreciated !

    Like

  9. Francisco said:

    Muchas gracias Paul… 🙂

    Like

  10. Philip Coucke said:

    Hello Paul,
    Thank you for all the good work.
    The combination German-Dutch and French-Dutch, as ‘Manicle’ suggested above, would also be more than highly appreciated by me… For many translators such (IT) matters are really giving headaches, so muchos gracias herefore, this would really be fantastic. Have a nice day.
    Phil

    Like

    • Thanks Philip, I added both pairs to the article. It’s quite an interesting exercise.

      Like

      • Philip Coucke said:

        Thanks to you Paul…

        Like

      • Philip Coucke said:

        Hi Paul,
        I tried to import the German-Dutch version into Multiterm.
        Is it ok to import the tbx file straightforward into Multiterm.
        I created a new multiterm termbase file and tried to import it into this termbase, but the import wizard finishes by saying that no terms are processed.
        Thank you for your advice
        Phil

        Like

      • Hi Philip, the way to tackle this is to convert the TBX to MultiTerm XML. So you create a new termbase in MultiTerm with the appropriate definition and then import the XML. The problem part, or rather the difficult part, is that you probably won’t be able to convert the whole TBX in one go. So if you look at the previous article I referred to it explains how I managed this for the full TBX with all 24 languages, and also one with just 5 languages. The process will be the same for you. I just loaded the TBX files for users because these are useful for anyone and not just MultiTerm users.

        Like

    • Philip Coucke said:

      Thank you Paul.
      Is there a way to import the tbx files “simply” into Studio itself as a TM; as many translators, I’m not strong in software-related issues/conversions, so may there’s a trick to import them without technical pains/issues into Studio? You never know 🙂
      Many thanks,
      Phil

      Like

      • Hi Philip, the main problem with converting to a TM is that a TBX is concept based which means that each entry could have multiple terms in each language. Furthermore each entry could have, for example, 10 terms in German and 5 terms in Dutch, or 3 terms in Dutch and no terms in German especially since this TBX is the result of an extract from a larger TBX based on 24 languages. Which ones should be matched for a TM? Technically the process of conversion is quite simple for some tools, so Xbench for example quite easily imports a TBX and then allows you to export it as a TMX. How useful this is, and what the logic is to deal with synoymns and fields I have no idea but the process is simple enough…. in fact I just did it!
        You can download the TMX here : de(DE) – nl(BE)

        Like

      • Philip Coucke said:

        Thanks a lot Paul! I made a separate TM and it works. If you can just contact me by ph.coucke@telenet.be, this would be very kind. Many thanks Phil

        Like

  11. Francisco A. said:

    I’d like to suggest the following lang pairs:
    English Portuguese
    French Portuguese
    Italian Portuguese
    Thanks, F.

    Like

  12. david1610dhdavid hardisty said:

    Thank you Paul. I emailed the IATE development team yesterday and got this answer back today which I think is worth sharing:
    “We are preparing smaller files, organized by individual languages: they
    should be easier to be handled.
    After downloading, it will be possible for the users to create customized
    language pairs (using SDL), to meet their specific needs.

    We expect to have these files available on IATE by the end of August 2014.

    Best regards,

    Coordination IATE Support & Development Team
    TRANSLATION CENTRE FOR THE BODIES OF THE EUROPEAN UNION
    iate@cdt.europa.eu

    Like

    • That’s good news David. I did expect to see this as they have only just started to share. I think their online facility is better as well and I expect to see more done with this in the near future since it contains a lot more metadata than the TBX and is more useful.
      Thanks for sharing your email.

      Like

  13. Thanks Paul! I’m currently experimenting with importing the Dutch-English TBX you posted here into Heartsome Translation Studio (the newly open source version), to see how well it preserves the data structure, and to then possibly export it back out into sth more useful (hopefully a tabbed UTF-8 text file for use as a CafeTran termbase).

    Like

    • Great, let me know how you get on. I read your other comment on FIGS so I loaded this to the article as well just in case anyone is interested in these five languages together in a single TBX.

      Like

  14. Francisco A. said:

    I am trying unsuccessfully to open the TBX files on Xbench. Any advice?

    Like

    • Probably good to ask someone from ApSIC or maybe another user of Xbench will see this post and respond. I tested a couple just now to check and all seems well so I don’t think the TBX files are the problem. I’m using Xbench 3.0.0 build 1243 if this helps?

      Like

  15. Hi Paul,

    I am trying to import your IATE FIGS tbx file into a Multiterm 2014 terminology base, and I run into problems.
    I get the following error message when I try to convert it using SDL MultiTerm Convert.

    The conversion option could not be initialised properly.
    Exception of type ‘System.OutofMemoryException’ was thrown.

    Do you have an idea how I could overcome this difficulty?

    Like

      • Paulo Santos said:

        Hi Paul, I had this same issue with the complete Iate file, then i’ve tried to process the 2 language pair files you have posted (EN-PT and FR-PT) and the problem now was corrupted characters… used Editpad lite to find and fix corrupted charact but then the error turned into “not valid tbx format” in all my cat tools. Used Glossary converter, Studio 2014, Multiterm Desktop 2014, Multiterm convert, Across, MemoQ, Xbench. We are in mid september and Iate did not keep up to their dates on publishing 2 pair files. Any help would be much appreciated.

        P.S. By the way, I’ve been able to import TBX files from Microsoft language resources from EN-PT and EN-BRpt without any issues.

        Like

      • Hi Paulo, the problem you will face is that even the single pair will probably be too large for most tools. So you still need to break it up into bitesized chunks unless you are using MultiTerm Server. My recommendation to you would be to take advantage of the great work Henk Sanderson has done that I described here: IATE, the last word… maybe!

        Like

      • Seconded.
        Henk’s files are better than the pretty decent ones I managed to compile from the IATE material.
        Henk is also very helpful. He responded instantly to a minor issue I had with the EN-EL tmx file.

        Like

  16. Olga said:

    Hi Paul,
    I would be very happy with a Spanish-Dutch version.

    Like

    • Olga said:

      and a Dutch-Spanish one, of course…

      Like

      • Olga said:

        meaning: ES-NL and NL-ES

        Like

    • All done. You only need one as a termbase will work in both directions.

      Like

      • Olga said:

        Thank you very much!

        Like

  17. Birgit said:

    And I would like the combinations Slovak-German and also Czech-German. Thank you, Paul

    Like

  18. John said:

    Hi Paul,
    French>Polish would be much appreciated too.

    Like

  19. Peter said:

    Hi Paul

    I would be glad, if Danish-German could be made available. TIA

    Like

      • Peter said:

        Very nice, thanks a lot.

        Like

  20. Paul,
    I downloaded the PT – EN file, but mt convert keeps giving me this error:

    the conversion option could not be initiasised properly.
    “hexadecimal value 0xFFFF, is an invalid character. Line 18277776, position 39.

    i’m also getting an error trying to convert the IATE file (after extracting the language pair) about missing .dtd file.

    Any help is appreciated, and keep up the good work!

    Like

    • Hi Robin, I went to that line and location in the TBX and found this:

      <term>protocolo de alteraç￿￿ã￿o</term>

      So if you delete or just correct this you’ll probably be ok… I didn’t test it. As I have pointed out in this article the TBX files leave a lot to be desired and you would be better off taking a cleaned up version from Henk. In fact funnily enough the exact problem you found is in that article!!

      Like

  21. Thanks for the reply, Paul. I did a workaround using the iate extractor and created around 15 sdltb files by the time I was finished. I plugged them into a project file and I’m testing it right now. Do you think having many separate small tb files in one project has a downside?
    Thanks again for your help!

    Like

  22. Anna said:

    Hello Paul,
    What a great help to translators! Would it be possible to have English>Russian pair?
    Thank you!
    Anna

    Like

    • Hi Anna,
      I added a note into the article as follows because you have a much better option now:

      Update Date: 150222
      Since writing this article the team responsible for IATE have created a tool you can download that allows you to extract single pairs in any combination. It’s called IATExtract. So I won’t be extracting anymore from this date. The pairs below were requested in the comments so I have done these, but I won’t be doing any more. Notwithstanding this the latest download from IATE will contain more terms anyway so you are best to take your information from their website.

      So I think the best approach is to do this yourself using the links in the article.

      Regards

      Paul

      Like

      • Anna said:

        Thank you, Paul.
        I just found out that Russian is NOT one of 24 official EU languages. So, I can’t extract En-Ru pair. 😦
        Best regards!

        Like

      • Of course 😉 I should have noticed that too! You could create one from the Microsoft TBX collections maybe… perhaps that would be interesting for you?

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: