PDF文件的处理及翻译方法(一)
作者:Hector Calabia
For many translators, Acrobat Portable Document Format files are nightmarish; even more, in forums and mailing lists periodically the question reappears again: How can I edit this PDF document?
Short answer: you can’t.
Long answer: it is possible, but only limitedly, and quite possibly your client will not be happy with the results.
The point is that PDF documents were never meant for editing or translation. About ten years ago, Adobe Inc., its creator, was successful in responding to a market need: documents that could be easily exchanged, printed, and viewed but not modified on all kind of computers. That is, the computer equivalent to a printed document. This is exactly what Acrobat documents are.
They must be considered printed” documents, not “editable” documents. The format has been so successful, that there is a steady flow of PDF documents to translate. However, as most computer formats are editable, there is a problem: despite the original intention of its creators, people cannot be convinced that PDFs are uneditable, and so they ask for translations and modifications to these documents.
Handling PDF Documents
Many translators already know the answer: you cannot deliver a translation on PDF, at least on the same PDF that you have been delivered. The format is not (extensively) editable.
What you can do is try to extract the text from the PDF and process it using your favorite word processor. The most straightforward and economical procedure is simply pressing the “Select Text” button in Acrobat Reader, and pressing Ctrl-A (Select All), and copy the contents to the clipboard (Ctrl-V). Then you can paste this into your word processor.
Depending on the complexity of the page layout, this may prove minimally satisfactory. Although I haven’t used this process for some time, I have just copied-pasted a PDF to Word and the result is usable, at a pinch: fonts and type sizes are kept, tables disappear, although their content is preserved (in a somewhat mangled form), illustrations are gone. The main problem is that each line ends in a hard carriage return/line feed, which generally has to be replaced by a single space in order to have continuous sentences again. I have developed a small Word macro that searches for carriage returns and replaces them by spaces. This however, has to be done one line at a time, under human supervision, because the system cannot know when the carriage return should be kept (for instance, in headings, lists, and at the end of paragraphs.)
Automatic conversion
In many instances, an automatic conversion program is preferable. I have used both a “pure” PDF to Word converter (Scansoft PDF Converter) and optical character recognition software (Omnipage and Fine Reader). You can find a healthy provision of both types by doing an Internet search on “PDF extraction” or “PDF conversion”.
What is the difference between them?
I have already said that PDFs are like printed documents. In most cases, however, the text is kept as computer characters, that is, you can copy/paste it. In some other cases, all the text (or some of it, in headings, for instance) is just an image, like the characters on a faxed page. “Pure” PDF converters can handle computer characters, but they choke on graphics. If a document contains all or some “graphics-characters” areas, they cannot process them. In this instance, optical character recognition programs come to the rescue. They look at the page as if it were really a printed page, and they try to interpret it and convert it to computer characters. They may also extract the illustrations on the page. It is not necessary to print and scan the Acrobat pages for this. Modern OCR software accepts them directly.
PDFs can also be “password protected”. If you do not have the password you cannot extract text from them. Character converters cannot process these files unless provided with the password. OCR converters can handle them perfectly, as they just “look at the pages”, not using their internal character coding.
(编辑:吴颖慧)