9 July 2009

Tips for translating PDF files

In almost any of my Trados trainings, one of my trainees will at some point ask me "And what about PDFs?" Indeed, it's fine to learn how to translate word processor documents or web pages with Trados or other CAT tools, but most translators receive some (or all) of their source documents in PDF format.

Usually this question leads to an animated discussion. Some trainees say they use third-party tools to convert PDF files into Word documents. "It works perfectly", some say. "I tried it and it's useless - the layout becomes a mess", others argue.

The new SDL Trados Studio 2009 offers support for PDF files, but I have some mixed feelings about that. It creates the (false) impression that PDF is a format like any other. In reality, a PDF is, well, like a box of chocolates. You never know what you're gonna get.

Here are a few tips that may be helpful when you need to translate PDF documents with CAT tools.

1. Ask for the original files

Let's say that I have written a manual in Adobe FrameMaker, and I want to distribute it on my website. Only people who have FrameMaker on their computer would be able to open that manual. So in order to make this document accessible, I decide to publish it in the Portable Document Format (PDF). Now it can be opened on most computers, even across different platforms (Windows, Mac, Linux, ...). That's what PDFs are all about.

But while the distribution of PDFs is easy, extraction of translatable content from a PDF document is a much bigger challenge. So as a translator, always try to convince your customer to send you the original files as well. Professional CAT tools are much better at processing the underlying formats, such as Word, PowerPoint, FrameMaker, InDesign etc. This is by far the best workflow.

Sometimes, your customer may not have the files in the original format, though. In that case, continue reading the tips below.

2. Choose a reliable PDF converter

Acrobat Reader or Foxit Reader are free tools that enable you to open a PDF document. You can then even save the content as a text file. But by doing so, a hard return will appear at the end of each line, which will cause incorrect segmentation in your translation editor. So you'll need a more sophisticated solution instead.

SDL Trados Studio 2009 and some other CAT tools include a third-party PDF converter. If you CAT tool doesn't support PDF, try ReadIris, Nuance PDF Converter, Solid Converter or Abbyy Transformer, or a free online service like PDFtoWord.com.

3. Manage your customer's expectations

Inform your customer about the challenges of converting PDFs. For instance, it may be possible to extract the text, but the original layout may be (partly) lost, especially when the document consists of multiple columns or text boxes. If the customer expects to receive the translated document with an identical layout, extra work may be needed. Is your customer prepared to pay extra for this?

4. Ask for a sample file before accepting the job

Suppose I had a paper document, took a picture of it, and pasted that picture in an empty Word document. Would that qualify as the Word version of my document?

Your customer's "PDF document" may have been created in a similar way. Imagine a hard-to-read fax, printed on thermal paper, that was scanned as a picture with a flatbed scanner and then saved as PDF. You may even see coffee stains or other dirt on the document. Technically speaking, it's a PDF file, but from a translation automation point of view, it's about as useless as a handwritten document. It will be virtually impossible to extract any text from such PDFs, so you may have to retype the source document before you can even start translating it.

5. Test the conversion - again and again

Even the best PDF converter may not succeed in extracting all text properly. Your solution may for instance work fine for English or other Western languages... but can it handle Russian, Korean or Amharic?

I tried converting a mixed English and Slovak PDF with Zamzar, and all characters with Slovak diacritics were corrupted. If you want to know whether it's your converter or the PDF itself that's causing the problem, simply try another conversion solution. PDFtoWord.com converted my Slovak text without problems.

But my next PDF may have been generated from a legacy DTP format on an old Mac, with text saved as pictures, and with strict security settings. You never know what you're gonna get...

Bookmark and Share


Zsolt said...

I agree, processing PDF files can be a nightmare... In my opinion, a PDf files built up by scanned pages is not a real PDF file, not even technically, just a set of image files.

According to my experiences, the best tools for PDF processing are:

PDF to Word: Able2Extract
Word to PDF: PrimoPDF

Forget about using Adobe Professional for converting, it's totally useless, destroys the entire layout. And forget about SDL Trados 2009, it's hilariously expensive, and installing is a real nightmare. I didn't try it yet, but I am convicted, that it will definitely not handle PDF properly. SDL Trados 2009 is just a mistake, like Vista, filling a time gap between two proper releases.

Gerrit said...

Thanks for pointing out these additional tools, Zsolt.

SDL Trados Studio 2009 uses the Solid Converter technology (see the link in my blog post) to do the PDF conversion. It should work fine, at least if the PDF isn't a mess to begin with...

dmy said...

Thanks for your sound tips, Gerrit. Someone else has told me that SDL Trados 2009 uses Solid Converter technology. What is this assertion based on?
When I got this information, I downloaded a trial version of Solid Converter, and got significantly different results on the same file. (I'm going to publish detailed results soon.) If the technology is the fundamentally the same, the results should be identical because the conversion engine will be the same in both products.
Can anyone clear up this issue?

vasu said...

I agree with dmy
I recently completed a job and the input given was pdf, when i asked for original i was told to use solid converter. I downloaded and tried it it could not process images with text. I went back to my favorite abbyy fine reader which did the job fine. I have been using it for ages and feel more comfortable with it. It es even better than IRIS or other extractors and converters which may be faster and automated since it gives more control over the block types. and also learns??

barbudo said...

Recently I coordinated a project involving a translation of a number of PDF files (created by General Electric). First I imported them into SDL Trados 2009 but the resulting Word files were a mess, so I kept looking for a solution. I found a tool called Infix PDF Editor. The professional version allows the translator to extract the text from the PDF file (provided the PDF is a properly built Acrobat document, not a set of scanned images) and import it as XML documents. We then translated the XML files with memoQ and exported the translated files back to the PDFs using Infix editor. The resulting Acrobat documents needed some (rather simple) editing but it was nothing compared with sorting out the mess left by Trados.