Usually this question leads to an animated discussion. Some trainees say they use third-party tools to convert PDF files into Word documents. "It works perfectly", some say. "I tried it and it's useless - the layout becomes a mess", others argue.
The new SDL Trados Studio 2009 offers support for PDF files, but I have some mixed feelings about that. It creates the (false) impression that PDF is a format like any other. In reality, a PDF is, well, like a box of chocolates. You never know what you're gonna get.
Here are a few tips that may be helpful when you need to translate PDF documents with CAT tools.
1. Ask for the original files
Let's say that I have written a manual in Adobe FrameMaker, and I want to distribute it on my website. Only people who have FrameMaker on their computer would be able to open that manual. So in order to make this document accessible, I decide to publish it in the Portable Document Format (PDF). Now it can be opened on most computers, even across different platforms (Windows, Mac, Linux, ...). That's what PDFs are all about.
But while the distribution of PDFs is easy, extraction of translatable content from a PDF document is a much bigger challenge. So as a translator, always try to convince your customer to send you the original files as well. Professional CAT tools are much better at processing the underlying formats, such as Word, PowerPoint, FrameMaker, InDesign etc. This is by far the best workflow.
Sometimes, your customer may not have the files in the original format, though. In that case, continue reading the tips below.
2. Choose a reliable PDF converter
Acrobat Reader or Foxit Reader are free tools that enable you to open a PDF document. You can then even save the content as a text file. But by doing so, a hard return will appear at the end of each line, which will cause incorrect segmentation in your translation editor. So you'll need a more sophisticated solution instead.
SDL Trados Studio 2009 and some other CAT tools include a third-party PDF converter. If you CAT tool doesn't support PDF, try ReadIris, Nuance PDF Converter, Solid Converter or Abbyy Transformer, or a free online service like PDFtoWord.com.
3. Manage your customer's expectations
Inform your customer about the challenges of converting PDFs. For instance, it may be possible to extract the text, but the original layout may be (partly) lost, especially when the document consists of multiple columns or text boxes. If the customer expects to receive the translated document with an identical layout, extra work may be needed. Is your customer prepared to pay extra for this?
4. Ask for a sample file before accepting the job
Suppose I had a paper document, took a picture of it, and pasted that picture in an empty Word document. Would that qualify as the Word version of my document?
Your customer's "PDF document" may have been created in a similar way. Imagine a hard-to-read fax, printed on thermal paper, that was scanned as a picture with a flatbed scanner and then saved as PDF. You may even see coffee stains or other dirt on the document. Technically speaking, it's a PDF file, but from a translation automation point of view, it's about as useless as a handwritten document. It will be virtually impossible to extract any text from such PDFs, so you may have to retype the source document before you can even start translating it.
5. Test the conversion - again and again
Even the best PDF converter may not succeed in extracting all text properly. Your solution may for instance work fine for English or other Western languages... but can it handle Russian, Korean or Amharic?
I tried converting a mixed English and Slovak PDF with Zamzar, and all characters with Slovak diacritics were corrupted. If you want to know whether it's your converter or the PDF itself that's causing the problem, simply try another conversion solution. PDFtoWord.com converted my Slovak text without problems.
But my next PDF may have been generated from a legacy DTP format on an old Mac, with text saved as pictures, and with strict security settings. You never know what you're gonna get...