Copywriter, technical writer, translator (FR>EN, ES>EN, IT>EN), journalist

Pulling text out of PDFs

My favorite way to research is to ask questions of subject matter experts. They take you right to the core of a topic and help you consider angles you didn’t know about.

My second-favorite way to research is to read documents online and pull text from them into my research notes. Many of these documents are PDFs, and I usually just select the text I want and drag and drop that text into my notes.

But sometimes I run into a PDF in which I can’t select text, so I can’t copy and paste from the document – at least not right away. Fortunately, I can fix that problem pretty quickly.Some (not all) PDFs contain text that programs like Adobe Reader “see” as pictures. You can copy it as a picture, but that defeats the purpose if you want to quote the text – you still need to retype it as you would from a paper copy.

That’s where optical character recognition (OCR) comes in. Certain programs, like Adobe Acrobat, can turn the text in non-OCRed PDFs into text that you can copy. (For a detailed explanation of OCR, read this Wikipedia article.)

Note: You may scan a PDF using OCR software and find that the software can’t recognize the text. This may be due to the author having “locked” the PDF before distributing it. If you don’t have the password need to unlock the PDF, you may want to find the information elsewhere (though tech-savvy people can, of course, find ways to get around password protection).

Acrobat, while a fine tool (I use it), is also pricey. Less expensive PDF-handling software abounds, so you may want to shop around. A friend pointed me to this review of Google’s free OCR offering. I haven’t tried it, but if you do, let me know what you think.