multifarious

Handling PDFs… is there a best way?

We all know, I think, that translating a PDF should be the last resort.  PDF stands for Portable Document Format and the reason they have this name is because they were intended for sharing with users on any platform irrespective of whether they owned the software used to create the original file or not.  Used to share so they could be read.  They were not intended to be editable, in fact the format is also used to make sure that the version you are reading can’t be edited.  So how did we go from this original idea to so many translators having to find ways to translate them?

I think there are probably a couple or three reasons for this.  First, the PDF might have been created using a piece of software that is not supported by the available translation tool technology and with no export/import capability.  Secondly, some clients can be very cautious (that’s the best word I can find for this!) about sharing the original file, especially when it contains confidential information.  So perhaps they mistakenly believe the translator will be able to handle the file without compromising the confidentiality, or perhaps they have been told that only the PDF can be shared and they lack the paygrade to make any other decision.  A third reason is the client may not be able to get their hands on the original file used to create the PDF.

Whatever the reason, handling the PDF is tricky for a number of reasons:

  1. It might be a scanned image in which case you need to OCR the file first to have any chance of getting at the text and the success of this will vary considerably with the quality of the PDF and how easy it is to get at the text where it sits over images for example, or even coloured backgrounds.
  2. The conversion of the PDF to allow you to translate it might be a text only extraction in which case you might have to extensively DTP the file afterwards to provide a formatted document.
  3. The conversion of the PDF might be an attempt at creating a formatted text & image extraction, probably in Word format, and the extent of DTP afterwards will range from nothing to a serious amount of work depending on the type of content and the quality of the PDF.
  4. And then the final format of the file.  What is the client expecting?  If they provide a PDF and expect InDesign files back then you have more work after the translation because you are probably going to end up with a Word file at first.  There are tools to help with this but it’s still more work afterwards.

The last point there is probably no way around without a lot of work so I’m going to concentrate on how to return a PDF.  I know that sometimes you may have a client who actually wants a Word file back because they really did lose the source, and Studio is excellent for this because you’ll have a source and target version in Word when you’re done, but I’m going to concentrate on returning a PDF and how to get the best quality finish.  Now, some translation tools will handle a PDF for translation as we know.  Some can even do a rudimentary OCR, some do a very good OCR.  Some handle the PDF as a text file only and some will make an effort to reproduce the formatting of the PDF by converting to DOCX.  But as far as I know, none will allow you to recreate the PDF so that the formatting is as good as it is in the PDF itself.  So is there a best way?  Probably not one best way as it will depend on the file you have been given, but I’m going to share the one I like the best so far as it has bailed me out of many tricky PDF related problems in the last year or so.

InFix PDF Editor

InFix is a PDF editor developed by Iceni Technologies, and basically it’s a tool that allows you to edit the text in a PDF… sort of an Adobe substitute you might think.  But in actual fact it’s a little more than that because it has this very handy menu giving it away as a tool that could be very handy for translators:

Actually this menu is from Version 7, and the XLIFF approach may have resulted from the valuable lessons they learned in working with a few people in our business.  The difference is that the Local menu item at the bottom is from Version 6… ish and this allowed you to export the extracted text to an XML or Plain text (with markup) format.  They even provided some filters for use with “popular CAT tools”, although sadly haven’t realised that Trados is completely redundant and hasn’t been sold for around 7-years, but they still provide an ini file!  I’d be happy to provide a suggested sdlftsettings file for Studio if anyone needs it!  (Post publication addition: After being asked in the comments I put an sdlftsettings file for the txt and the xml exports here) The other items at the top are all Version 7 and this is far more interesting and reliable.  This version extracts the translatable text from the PDF and exports it as an XLIFF.

Now, the reason the bottom item is called Local is because the InFix application does all the work on your computer.  The XLIFF parts however are all done in the cloud using their TransPDF website.  This is quite impressive and you can use this without the InFix PDF Editor at all.  The idea is you upload a PDF, select the language pair you want, download the XLIFF, translate it in any translation tool you like, upload the translated XLIFF and the cloud miraculously returns the now translated PDF ready for further editing or handing over to your client as it is.  There is a cost associated with this and at the time of writing you get 50 PDF pages for free and then pay 50 cents a page thereafter.  So if you don’t get a lot of PDFs that need translating this could be exactly the tool you’ve been looking for.  You pay as you need it and build the cost into the price for your client… couldn’t be easier!

Also worth mentioning the cloud solution a bit more.  When you sign up you get your own account which keeps a track of the projects you might be working on and also provides a flight check guide to anything you need to address such as font changes where a different font would be better to represent the characters in the target version for example.  You can use this dashboard independently of the InFix Editor, but if you do have InFix then the process is quite well integrated allowing you to work only from the desktop tool, connecting to the cloud when needed.

If you do get a lot of PDF files then I’d recommend you purchase the InFix PDF Editor.  This is really a wonderful tool even without the translation options.  You can almost treat a PDF as if it was just a word file, or a publisher file.  Not nearly as flexible of course but amazingly good.  On price, well this is another thing that’s changed with Version 7, it’s now a subscription service and has some very good value options:

If you take any of these then the TransPDF feature is free of charge, you just use it whenever you like.  So if you do more than 120 pages a year then the annual license pays for itself easily.  If you have a 12 page document to do then even the monthly license is worth it.  If you have any editing at all to do in the PDF afterwards to try and get a more polished translated version for your client then you won’t need to buy another PDF editor, you just use this.

Normally I would not go on about translating PDFs or software to help you with it, but this tool is really worth a look.  To make it easier to follow I’ve created a video with a PDF file I took from the internet (cut down to 3 pages for this demo), deliberately chosen so it’s not too easy, but also not too hard.  I did take a look at what you get with various translation tools that can handle PDFs according to their documentation… also quite enlightening, but I’m not going to discuss that in here!  That exercise did reinforce my opinion that Studio does have the best PDF converter built in.  It’s not always good for all the reasons already discussed but as you’ll see it provides an excellent attempt with this example file.  Have a look for yourselves and test it in your tools if you don’t believe me!

Video duration approx. 17 mins

That was it… if anyone asks me what’s the best way to handle a PDF my initial answer is still the same… get the original source file.  But at least now I have a pretty good second choice before resorting to the translation tools themselves.

A final word would be the potential for improvements.  I would love to see Iceni use the Studio API to create a new view that did the following:

That would be a very nice enhancement for project managers and translators dealing with large numbers of PDF files and probably not difficult to do from the Studio side.  Maybe for Xmas 😉