Keeping your secrets to yourself—old changes lingering in your PDF files

Rusty trapA few months ago I wrote an article that touched upon the problems inherent in attempts to sanitize documents before sending them to the enemy—perhaps to remove competitor’s names or trade secrets.

I was reading a post on a board I frequent where a person was describing exactly this kind of activity—removing sensitive information from PDF documents. Several suggestions were made, but one individual suggested opening the file in Acrobat Pro and replacing the sensitive text with good old Lorem Ipsum.

It was at that moment that I recalled a peculiar feature of the PDF file format: it is designed to support nondestructive updates, allowing people to make vast changes to a PDF document while still retaining the original document, fully intact. I did a few experiments and was surprised with the results.

A Brief Note on the PDF File Format

For the geeky types among us, one place to begin is this article:

Portable Document Format: An Introduction for Programmers

The key points to get out of the article is this: A PDF document is comprised of several distinct sections, a Header, a Body, an “xref” Table, and a Trailer. At the very end of the file you will find the character sequence %%EOF

The PDF standard was designed to allow multiple updates to a document, while retaining the original version. This is accomplished by appending anything new to the end of the document, after the original EOF tag. The document will now have two EOF tags: one indicating where the original document ended, and a new EOF tag indicating where the new changes end.

If we wish to revert PDF changes, it should be a simple matter of opening the PDF file in a binary editor, searching for the first EOF tag, and deleting everything following.

A Simple Experiment

Let’s start with a proper secret document containing missile plans…

20091123-missile-plans-1

Suppose we want to obscure some special information in paragraph 37. We can open the file in Acrobat Professional and use its text editing features to swap in the venerable Lorem Ipsum text.

Here’s what it looks like after the switch:

20091123-lorem-ipsum

You can see here that the first seven lines of text starting on paragraph 37 have been replaced with appropriate unreadable text.

Now, open the new PDF file in a binary editor (since PDF files contain a mix of text and binary, the editor must be a binary editor).

20091123-binary-editor

Note the %%EOF character sequence embedded in the text. This is the first EOF tag, indicating where the original file ended. All we need to do is place the cursor to the right of the EOF and delete everything to the end of the file.

Once we have done so, it’s like magic:

20091123-after-binary-editing

The edits that replaced lines of paragraph 37 with gibberish have neatly been undone!

More Details

From the PDF Intro document linked earlier:

“The trailer, it turns out, plays an important role in the way PDF implements incremental updating. The key concept to understand here is that a PDF file is never overwritten, only added to. That goes for all portions of the PDF file – even the trailer itself, and the end-of-file marker. In other words, a multiply-updated PDF document may contain multiple trailers – and multiple end-of-file markers! (There may be numerous occurrences of %%EOF.) Each time the file is edited, an addendum is written to the tail of the file, consisting of the content objects that have changed, a new xref section, and a new trailer containing all the information that was in the previous trailer, as well as a /Prev key specifying the byte offset (from the beginning of the file) of the previous xref section. The cross-reference info will then be distributed across more than one xref section. To access all of the cross-references, the reader must walk the list of /Prev keys in all the trailers, in reverse order.

Space doesn’t permit a detailed exploration of updates here, but you can find several examples in Appendix A of the PDF 1.3 specification (available at http://partners.adobe.com/asn/developer).”

Summary

It is important to understand that the PDF standard allows for appended updates to files that leave the original document intact, regardless of how drastic the changes are. If you are intent on redacting text from PDF documents, do not depend on simply deleting the secrets using a PDF editor—you must use a proper redaction tool that addresses these issues correctly.

That said, I did some experimenting with a few utilities (Apple Preview, PDFpen, and Adobe Acrobat Pro) and found that some write the file from scratch each time, with no lingering cruft from former versions, while others respect the original intent of the PDF standard. This means that you can’t trust that older revisions are being retained in your file and you can’t trust that they aren’t.

Be conservative: use a redaction tool for secrecy and proper backups for versioning.

5 Responses to “Keeping your secrets to yourself—old changes lingering in your PDF files”

  1. John McLaughlin writes:

    Hey Tad,

    Just stumbled across your blog and I’m really enjoying it — I’m currently on a quest to de-paper (scansnap just arrived) so it’s great to learn from your experience. Keep up the good work!

    -John

  2. Tad writes:

    Hi John,

    Glad to hear you’re ScanSnapping away with intent… Drop me a line as you come up with your own workflows and work out the kinks in the process.

    Good luck!

  3. John McLaughlin writes:

    WIll do — Rightnow my workflow is scan into a BAF (big ass folder!) with reasonable names — I’m playing with Leap for doc management and keeping it all in a dropbox so it’s backed up online. I’m still sort of working out the kinks though….

  4. What’s Inside That PDF Document? | DocumentSnap writes:

    [...] workings are like? OK probably not, but Tad over at Paper Jammed has, and he has put together a delightfully geeky rundown of some of the internal workings of the PDF document that allow for multiple [...]

  5. prasanth writes:

    The best way I found to ensure this is to ‘print to pdf’ anytime I am sending out a pdf or doc file for that matter. Its kind of scary what information could be left behind in the original file.

Leave a Reply