Handling PDFs… is there a best way?

001We all know, I think, that translating a PDF should be the last resort.  PDF stands for Portable Document Format and the reason they have this name is because they were intended for sharing with users on any platform irrespective of whether they owned the software used to create the original file or not.  Used to share so they could be read.  They were not intended to be editable, in fact the format is also used to make sure that the version you are reading can’t be edited.  So how did we go from this original idea to so many translators having to find ways to translate them?

I think there are probably a couple or three reasons for this.  First, the PDF might have been created using a piece of software that is not supported by the available translation tool technology and with no export/import capability.  Secondly, some clients can be very cautious (that’s the best word I can find for this!) about sharing the original file, especially when it contains confidential information.  So perhaps they mistakenly believe the translator will be able to handle the file without compromising the confidentiality, or perhaps they have been told that only the PDF can be shared and they lack the paygrade to make any other decision.  A third reason is the client may not be able to get their hands on the original file used to create the PDF.

Whatever the reason, handling the PDF is tricky for a number of reasons:

  1. It might be a scanned image in which case you need to OCR the file first to have any chance of getting at the text and the success of this will vary considerably with the quality of the PDF and how easy it is to get at the text where it sits over images for example, or even coloured backgrounds.
  2. The conversion of the PDF to allow you to translate it might be a text only extraction in which case you might have to extensively DTP the file afterwards to provide a formatted document.
  3. The conversion of the PDF might be an attempt at creating a formatted text & image extraction, probably in Word format, and the extent of DTP afterwards will range from nothing to a serious amount of work depending on the type of content and the quality of the PDF.
  4. And then the final format of the file.  What is the client expecting?  If they provide a PDF and expect InDesign files back then you have more work after the translation because you are probably going to end up with a Word file at first.  There are tools to help with this but it’s still more work afterwards.

The last point there is probably no way around without a lot of work so I’m going to concentrate on how to return a PDF.  I know that sometimes you may have a client who actually wants a Word file back because they really did lose the source, and Studio is excellent for this because you’ll have a source and target version in Word when you’re done, but I’m going to concentrate on returning a PDF and how to get the best quality finish.  Now, some translation tools will handle a PDF for translation as we know.  Some can even do a rudimentary OCR, some do a very good OCR.  Some handle the PDF as a text file only and some will make an effort to reproduce the formatting of the PDF by converting to DOCX.  But as far as I know, none will allow you to recreate the PDF so that the formatting is as good as it is in the PDF itself.  So is there a best way?  Probably not one best way as it will depend on the file you have been given, but I’m going to share the one I like the best so far as it has bailed me out of many tricky PDF related problems in the last year or so.

InFix PDF Editor

InFix is a PDF editor developed by Iceni Technologies, and basically it’s a tool that allows you to edit the text in a PDF… sort of an Adobe substitute you might think.  But in actual fact it’s a little more than that because it has this very handy menu giving it away as a tool that could be very handy for translators:

002

Actually this menu is from Version 7, and the XLIFF approach may have resulted from the valuable lessons they learned in working with a few people in our business.  The difference is that the Local menu item at the bottom is from Version 6… ish and this allowed you to export the extracted text to an XML or Plain text (with markup) format.  They even provided some filters for use with “popular CAT tools”, although sadly haven’t realised that Trados is completely redundant and hasn’t been sold for around 7-years, but they still provide an ini file!  I’d be happy to provide a suggested sdlftsettings file for Studio if anyone needs it!  (Post publication addition: After being asked in the comments I put an sdlftsettings file for the txt and the xml exports here) The other items at the top are all Version 7 and this is far more interesting and reliable.  This version extracts the translatable text from the PDF and exports it as an XLIFF.

Now, the reason the bottom item is called Local is because the InFix application does all the work on your computer.  The XLIFF parts however are all done in the cloud using their TransPDF website.  This is quite impressive and you can use this without the InFix PDF Editor at all.  The idea is you upload a PDF, select the language pair you want, download the XLIFF, translate it in any translation tool you like, upload the translated XLIFF and the cloud miraculously returns the now translated PDF ready for further editing or handing over to your client as it is.  There is a cost associated with this and at the time of writing you get 50 PDF pages for free and then pay 50 cents a page thereafter.  So if you don’t get a lot of PDFs that need translating this could be exactly the tool you’ve been looking for.  You pay as you need it and build the cost into the price for your client… couldn’t be easier!

Also worth mentioning the cloud solution a bit more.  When you sign up you get your own account which keeps a track of the projects you might be working on and also provides a flight check guide to anything you need to address such as font changes where a different font would be better to represent the characters in the target version for example.  You can use this dashboard independently of the InFix Editor, but if you do have InFix then the process is quite well integrated allowing you to work only from the desktop tool, connecting to the cloud when needed.

003

If you do get a lot of PDF files then I’d recommend you purchase the InFix PDF Editor.  This is really a wonderful tool even without the translation options.  You can almost treat a PDF as if it was just a word file, or a publisher file.  Not nearly as flexible of course but amazingly good.  On price, well this is another thing that’s changed with Version 7, it’s now a subscription service and has some very good value options:

  • £5.99 per month, renewed month to month
  • £59.99 per annum for a single user
  • £1,199 per annum for up to 1000 users

If you take any of these then the TransPDF feature is free of charge, you just use it whenever you like.  So if you do more than 120 pages a year then the annual license pays for itself easily.  If you have a 12 page document to do then even the monthly license is worth it.  If you have any editing at all to do in the PDF afterwards to try and get a more polished translated version for your client then you won’t need to buy another PDF editor, you just use this.

Normally I would not go on about translating PDFs or software to help you with it, but this tool is really worth a look.  To make it easier to follow I’ve created a video with a PDF file I took from the internet (cut down to 3 pages for this demo), deliberately chosen so it’s not too easy, but also not too hard.  I did take a look at what you get with various translation tools that can handle PDFs according to their documentation… also quite enlightening, but I’m not going to discuss that in here!  That exercise did reinforce my opinion that Studio does have the best PDF converter built in.  It’s not always good for all the reasons already discussed but as you’ll see it provides an excellent attempt with this example file.  Have a look for yourselves and test it in your tools if you don’t believe me!

https://www.youtube.com/watch?v=nI77wkdxU1g

Video duration approx. 17 mins

That was it… if anyone asks me what’s the best way to handle a PDF my initial answer is still the same… get the original source file.  But at least now I have a pretty good second choice before resorting to the translation tools themselves.

A final word would be the potential for improvements.  I would love to see Iceni use the Studio API to create a new view that did the following:

  • Drag and drop your project PDF files into the new View
  • Bulk export the XLIFFs for all the files and create a Studio project
  • Once the Project was complete run a new Batch Task that exported the translated XLIFFs to a location where they can be imported back into the PDFs
  • Download the translated PDFs for final edit and review
  • Maybe include a similar view to TransPDF inside this Studio view to complete the picture.
  • … and one more added after the original article was posted.  Support for BiDi languages (Arabic, Hebrew etc.)

That would be a very nice enhancement for project managers and translators dealing with large numbers of PDF files and probably not difficult to do from the Studio side.  Maybe for Xmas 😉

30 thoughts on “Handling PDFs… is there a best way?

  1. Great article as usual Paul. If you remember during my PDF Syndrome session in Warsaw we talked about it, but I guess it took Iceni quite sometime to come up with this version. I wonder if it supports importing Arabic XLIFF files back to PDF format, knowing that RTL languages is very hard to deal with! Any ideas? Have you tried it with Arabic or even Hebrew?

    1. Hi Sameh, thanks for this question. I meant to mention this and forgot… my very bad!! I asked the guys at Iceni earlier in the week if they had plans for BiDi support as they don’t offer any BiDi languages in their conversion process. The answer was “No immediate plans for BiDi at the moment. However, we get asked about it fairly often so pressure is growing on us to do something about it.” I take this as a positive response and given the work Iceni have already done to bring this tool closer to translators requirements I think it’s hard to ignore such widely spoken and strategically important languages if you want to grow. Maybe you should contact them and make your case?

  2. Great article as usual Paul. I tried to comment a while ago but I think it did not go through so I will repeat it again. Have you tried this with Arabic? or any other RTL langauge? I guess it would really not work well, I have to try myself, but still with a complicated PDF like the one you have used in the Demo it will still be a mess because of the fact that you need to flip the layout to RTL! I just have to try myself my friend and will report back to you should it really work.
    Sam

    1. Hi Sam, it did go through but you raised some interesting problems associated with BiDi so I approved this one too. The DTP work involved in a file like the one I used would be considerable, so a far more costly exercise. But the same would apply in any format I guess, the difference being it’s harder to edit a PDF.

  3. Dear Paul, could you please attach here the aforesaid sdlftsettings for export to SDL Trados 2015 ?

    1. Hello Jack, I updated the article and added a link to a zip file containing two settings files for Studio 2015. One for the txt export and one for the xml export. But really, you’ll find the results using the XLIFF export via TransPDF are far superior to these. Neither are perfect and either contain a few things you can’t easily put together because of tagging between letters, or contain lists of fonts that are hard to hide without creating huge lists of every possible font to capture all instances. The results are not too bad and easily workable, but the XLIFF process is much improved. In fact the reason I didn’t write about this tool until now is because of the missing XLIFF. I recall showing InFix to a translator at the ATA a few years ago and even then it resolved a very difficult problem she was having with poor quality PDF files, but if she’s still using it I reckon she loves the XLIFF process!

      1. Thanks for the files. I tried them out. You was right – xdliff is making better result, but I don’t like sending my docs to some outside web page. It would be better to do the same job within the program.

  4. will certainly contact them Paul, this is very important and Arabic is spoken by so many people, so they really should consider investing in this.

  5. Dear Paul,

    Your Multifarious came right on time !

    I am currently wrestling with a big PDF file that I am trying to convert to WORD for translation in Studio 2014, but without success. I would greatly appreciate any help/advice you could provide.

    – PDF file: Word count ~ 5,500 with many pictures, drawings – Language pair: Vietnamese-into-English – My PM could not get a WORD file for me because the client … cannot get their hands on one ? – This project contains 3 PDF files. The other 2 PDF files (smaller, about 1,800 words each) were successfully converted to Word files by my PM. I don[t know what program she used, but she could not convert this big one though.

    – What I did: I passed this PDF file through Studio. It did generate a beautiful Word file. However, this Word file cannot be opened back in Studio (it was blank when opened). It turned out that this Word file is only a picture (OCR result => mirror image/picture), not a text. So it is not translatable.

    Would you recommend that I use InFix and/or TransPDF ? Or any other idea ? I think perhaps a XLIFF file might work in this case.

    Thank you very much in advance for your time reading this.

    Best ,

    Mai Tran Vietnamese-English-Vietnamese Freelance Translator

    1. Hi Mai, InFix does have an OCR capability but all I can recommend is you try it. There is a free trial so you won’t lose anything to test it out. I think the success or failure of OCR’ing the text will be dependent on how embedded the text is into the image and whether or not the text can be recognised. Give it a try!

    2. Hi there,
      I would recommend you to try some specialized OCR tools like this one https://finereaderonline.com (Files up to 10 pages for free). They have also possibility for MT of the text from 35 langs (Vietnamese included).
      Good luck.

      1. Thanks for the tip Jack. Out of interest I tested the same file in this article. First converted to docx… did a better job than Studio I think (as expected) but did lose a lot of the formatting where Studio did not. Overall seemed better for clarity of text. Then I tried the MT to French and asked it to retain the pdf output. It doesn’t, it converted to docx with the same quality finish as the straight conversion.
        So doesn’t come close to the InFix solution in my opinion, but perhaps it’ll be good for OCR as you mention… this I have not compared.

      2. Dear Paul, the reason why you didn’t get the translated pdf on output is because your initial file was already recognized, IMHO. Try a PDF without recognized text as in Mai’s case (a set of scanned pages, for example). And a scanned pdf will be then converted to OCR’ed (and translated if needed) PDF. Then you’ll see the difference. The quality of recognition depends of course on complexity of page design. To covert docx to pdf you don’t need any extra tool, even Word can do it.
        But in terms of editing recognized PDF, here I fully agree with you, Infix is unbeaten. 😉

        1. Hello Jack, I took your advice and tried again. I created an image based version of the same file I used in the article and ran it through Abby, Studio and InFix. Very interesting results! Abby took a long time and failed to recognise anything on the last page, so I redid that page and this time it says “page3_MiniCooper_forocr.pdf was not processed: the recognized document contains errors. Your account has not been charged”. I also couldn’t use MT for this because Abby insisted it knew, incorrectly, what language was being scanned and said it didn’t have MT for that language (German). Infix did a reasonable job on the last page but really bad on the first two, and Studio completely surprised me and seems to have done a better job than either of them on my test file, it identified and retained the images in a better way than the original Studio test did in the article, and had a good job at the text throughout the three pages. Still needed a lot of work but I am very surprised to see Studio deal with the file so well for an OCR.

      3. Very interesting results, Paul. I was surprised that Studio has built-in OCR engine. Why then it didn’t work for Mai?! I tried the same way you did with abbyy site. In my case it went well. Here is my recognized pdf version (https://www.dropbox.com/s/swa34thkos5xaoe/MiniCooper-JPGS-FR.pdf?dl=0) and here is docx translated by the same site (https://www.dropbox.com/s/snt1vwztc4xo9sk/MiniCooper-JPGS-FR_en-fr.docx?dl=0). Unfortunately I couldn’t get the version translated by Studio from the first time, because of some strange error (not closed end tag #xxxxx). I encounter this error rather often. Apropos, Paul, what is your workaround advise of this error?
        Luckily, I tried a second time and it went without errors. (Hmmm… ?!) So here is my file recognized and translated by Studio.(https://www.dropbox.com/s/2weu89vej6wr501/MiniCooper_JPGS-Studio.docx?dl=0). Now you can compare two Word documents. In my opinion both tools did good recognition job. But in some places abbyy was slightly better (more recognized text and correctly recognized pictures where Studio sliced them).

        1. My guess is she didn’t activate the option to OCR all characters which is normally needed to OCR the whole file. This option is in the filetype settings for OCR… my bad for not mentioning this originally, although I’d still give InFix a shot at this first I think despite my initial test results which I put here if you’re interested.

  6. Hi Paul,

    This is so good… But suddenly when I want to save the translation (in Studio) as an xlif file, I can only select the Word docx format! Have you any idea why this happens?

    Mats

    1. Hi Mats, it sounds to me as though you must have added the PDF to Studio and not the XLIFF. There is absolutely no way it’s possible for an XLIFF source to be converted into a DOCX target file! Was it formatted? If it was there’s your answer without a doubt!!

  7. Basic text editing in Adobe Acrobat (which I guess most translators have anyway) is not difficult, so it would be interesting to know what one can do in Infix which is not possible (or very difficult) in Acrobat. I believe the editing of the contoured text might be impossible in Acrobat, but could you tell me what other differences there are? The way I understand it is that if you can manage the necessary editing in Acrobat you don’t have to use Infix at all — right? (Although the catch is, of course, that you probably don’t know beforehand.)

    1. Hi Mats, the point here is that using InFix, or the online capability, allows you to export the text for translation and then import it back. So you end up with a translated PDF formatted as the original. I guess if you used the online export/import then you could use Adobe Acrobat for any editing, but that would be down to personal choice. I think if you have used the online feature then you may as well pay for infix unless it’s a one off and you already own Adobe. Infix is also half the price and you get the export/import features thrown in.

  8. Hi Paul. Thank you for a another great post. I need some help as I have to deliver two translated PDF-files tomorrow morning and I have tried to implement the above-mentioned ini-files in Trados Studio 2017 which did not solve my problem. Trados will still not accept the xlf.file from TransPDF as I receive an error message whenever I try to create a translation of a single document: “The file xx.xlf cannot be opened as it contains languages that are not configured for this translation”. Can you help me?

    1. Hi, I think you should post this into here so we can help you more easily – http://community.sdl.com/appsupport
      It sounds as though the language codes in the XLF are not matching your project which is odd, so I’d start by manually editing with a decent text editor and make sure that this is correct. Hard to know for sure without seeing your project files and the XLF. But please, try the community.

      1. Hi Paul. Thank you very much for your reply. I found out that the original language for the PDF was in fact in German and in addition the original file type was an InDesign-file. So no wonder why I was not able to make it work in Trados at all. I ended up converting the file to a Word document and translating it in Trados. As it look absolutely appalling I chose to type the Danish translation directly in to a Danish copy of the Word document….
        I will try to post my question in the community as you suggested. Hopefully I will be able to find out what went wrong :0).

Leave a Reply