I thought Studio could handle a PDF?

#1Update: Studio 2015 does have a built in OCR facility for PDF, so whilst this article is still useful, keep that in mind!  Also worth reviewing the solution from InFix using XLIFF.
Studio has a PDF filetype, and it can do a great job of translating PDF files… BUT… not all PDF files!
So what exactly do I mean by this, surely a PDF is a PDF?  Well this is true, but not all PDF files have been created in the same way and this is an important point.  PDF stands for Portable Document Format and was originally developed by Adobe some 20-years ago.  Today it’s even a recognised standard and for anyone interested you can find them here… at least the ones I could find:

Despite this recognition, the strength of the PDF is unfortunately its weakness when it comes to translation and this is because we need the text.  Any document format can become a PDF and there are many ways to get there.  So for example you can create a file in MSWord and save it as a PDF.  If you do this then Studio will normally be able to extract the text from this PDF and present it for translation.  But if the PDF was created from an image then no text will be extracted at all.  The reason for this is because there is no textual information in an image so you have to use OCR (Optical Character Recognition)… exactly the same as if you had to translate text from images inserted into a Word file.
If you receive a PDF that was created from images then you will probably see something like this when you attempt to open the PDF in Studio:
Image based PDF
This is because Studio was unable to find any text so you only see the start and end of file markers in orange.  So what’s the solution?  You have to OCR the PDF and extract the text.  There are plenty of OCR solutions out there, many users tell me the best of which is ABBYY Finereader.  This application can extract the text and even make a decent job of retaining the formatting.  But if formatting is something your client is prepared to tackle, because it may not be trivial, or you have another DTP application you intend to use for this, then you might find an accurate and free application like FreeOCR useful.  It uses the Tesseract OCR engine, originally developed at HP Labs and now improved and sponsored by Google.  The interface looks like this, with the pane on the left showing the image based PDF I opened and the pane on the right the extracted text:
FreeOCR
So if I take a PDF that is prepared on a textual basis, and the text extracted from the image based PDF using FreeOCR, and open them both in Studio I see this:
Difference between extracts
You can see the PDF on the left retains all its nice formatting, and things like hyperlinks and other tags would probably be retained too, whilst the one on the left is plain text.  It’s not perfect, but you could easily go through the text and tidy it up before opening in Studio… certainly the text extraction isn’t too bad.  The application also comes with built-in language packs to improve the character recognition for EN, DA, DE , ES, FI, FR, IT, NL, NO, PL and SE.  The final target file, if you can open the PDF directly in Studio, will be a DOCX anyway so it might be a good solution for many.
FreeOCR can also OCR directly from a scanner, so if you are only sent hard copies for translation and you have scanning capability then you can use FreeOCR for this too.  Certainly worth a look if you need to resolve a problem like this and you don’t have any other software already.
A final word on this would be to make the point that I think PDF is a last resort, even if Studio does have a PDF filetype.  It is always more appropriate to work from the original source file and avoid all of this additional work.

Leave a Reply