Dodged the corrupt-document bullet this time, just barely…

gibberish document in a file folderA couple of weeks ago, a co-worker sent me a PDF document to look at. He said that he was having trouble copying and pasting from the document and was scratching his head about why this particular PDF would have such issues.

As it would turn out, there were several thousand other documents on a file server that shared the same funny behavior. By the time we were done struggling with this problem I had gained new respect for PDF corruption issues and their prevention.

The Problem

We were looking to load a few thousand of these scientific reports into a fancy-schmancy new database, with linguistics searching and other bells and whistles. Much to our chagrin, these documents just weren’t loading, and we couldn’t understand why. They were text documents, with some embedded images, but mostly straightforward text.

Here is an excerpt:

20091027-plaintext

And you can tell that it is right and proper text because when I blow it up all the way, the fonts are nice and smooth—this isn’t just an image of text.

20091027-smooth-letter

But if I copy and paste that particular paragraph into any handy editor (Notepad, in this case), this is what I see:

20091027-notepad

And as far as I know, at this point the actual text is beyond the reach of average folks like me. We tried, believe me we tried.

What went wrong?

A quick Google of the subject led us to understand that many PDF generation tools embed subsets of fonts, with nonstandard mappings from the text to the font.

This fellow explains it nicely:

“The PDF file does not contain all the information to extract the text. The problem is that a character in a PDF file may not contain information what “real” character it relates to. Some PDF generators do a pretty bad job when they embed fonts into PDF files. They use a proprietary encoding mechanism (e.g. 1 is A, 2 is B, 3 is C, …) in both the embedded font and when they place glyphs on the page. Without a table that implements the reverse (e.g. character code 1 is ‘A’) you cannot extract text from such a file.

There is nothing you can do (besides to complain to whoever created the PDF file, and the author of the software that created this file).”
— from khkremer on experts-exchange.com

As it would turn out, many of the reports had been generated by printing to Adobe Distiller from Microsoft Word. It would seem that the default settings used for Distiller included the “totally hose my document content” switch.

The Solution

We fretted over this quite a bit. These are important scientific reports, and there is no way to easily ungarble them. We finally ended up contacting the Abbyy Finereader folks and trying out their OCR toolkit for Linux: not only did this product make fast work of running optical character recognition on the sample document, but once we had a script running, we managed to blow through the 10,000 pages the trial license gave us, in a day or two.

Imperfect, at best

I am happy that we were able to salvage the bulk of the electronic knowledge found within those thousands of files, but our work barely scratched the surface.

For example, most of these documents have rich bookmarking of sections and keywording, such as this (content tastefully blurred on purpose).

20091027-doc-with-contents

In addition, scientific documents typically have loads of tables full of numbers. Though it is possible to mine this data with a good OCR tool (the FineReader API provides tools for just this purpose), the tables are far more difficult to extract correctly once the original text information is lost.

Final thoughts

I wrote a few weeks about document formats, mentioning the PDF/A document standard. This is worth investigating, regardless of what your document needs are.

If our thousands of files had been originally generated as PDF/A, it is certain that we would have been able to copy/paste from them without problem: PDF/A prohibits such font shenanigans as were perpetrated on our garbled reports.

In the end, our OCR sledgehammer approach worked like a charm, and is probably sufficient for our needs. Text mining is a pretty slushy business, so no-one will complain if there are a few typos on each page—if they find the doc in a search, they can print it and read it the old fashioned way.

One Response to “Dodged the corrupt-document bullet this time, just barely…”

  1. A handful of sweet freebie tools to save the day | Paper Jammed writes:

    [...] I chose PDFCreator, because I am familiar with its use and I know that it does not munge the fonts. [...]

Leave a Reply