Keeping your secrets to yourself—what can your shared documents tell others?

Do you ever send documents to other people that might have … sensitive information embedded in them?

Not everyone who works with documents in the home will run into this problem, but sooner or later you are probably going to find yourself in a situation where you would like to email someone a useful document that just happens to have your social security number embedded in it, or your full name and address, or some other info that you would rather keep private.

This process of editing documents to remove sensitive content is referred to as redaction—that’s the keyword you probably want to be searching for as you tip toe through Google for guidance.

In this article I discuss the obvious problems we face using the most naïve approach toward document redaction, and provide some resources for better options.

The only sure way

The only absolutely certain way of guaranteeing that you cut out secret information would be to print the document, physically cut out the bad bits, scan in the document, and send the scanned PDF to your colleague. This may seem a bit extreme, but if you were an anonymous tipster sending the media a document full of mob-related evidence, containing your name, you might go this route (You probably don’t want to send the email from your personal account. Try a throwaway email account at the library.)

Other options… Microsoft Word

Don’t even think about sending a raw MS Word document to your recipient. There’s loads of hidden stuff within those documents that you might forget. If you really must, you can look into some recommendations from Microsoft, and consider tools such as Microsoft’s free Office add-in for removing hidden data.

Danger lurking in PDF documents

Since my paperless life really revolves around PDF documents, this is the most likely kind of document that I would be sending via email. Unfortunately, PDF documents have even more hidden data within than MS Office documents. Many people have been burned when they tried simple attempts at obscuring parts of a PDF.

A Simple Demonstration

I started with a nice PDF of the Declaration of Independence.

Now, supposing that we needed to send this document to a colleague, but we must not reveal the name of the original signer, we might try opening up the PDF in our favorite PDF markup tool and slapping a big fat rectangle over the sensitive information.

Now, all is good. But the enemy is crafty and they exploit the huge flaw in our thinking: the information never left the document. All they need to do is copy and paste:

A quick copy/paste from the PDF viewer application to Microsoft Word lets the whole world see that John Hancock is to blame! Better let him know we slipped up so he can take appropriate actions.

This sounds trivial, right?

In February, the Associated Press was able to uncover the secret details of the Facebook/ConnectU settlement using this same technique.

Apparently, the U.S. military has been caught in the same trap.

Last year, Google founder Larry Page’s home address info was leaked in a similar fashion.

How about Scanned Documents?

Up to this point I was working with a document that had been printed to PDF, thereby preserving the document text perfectly.

What about a document that we scan in?

Here’s some honest-to-goodness missile plans…

This is an excerpt from a scanned copy of the U.S. patent for the venerable Sidewinder Missile, complete with a black square that I have added to obscure some special information.

As seen here, the copy/paste trick still worked.

But why does it still work? Because the document had OCR run on it in the past.

A brief look at Acrobat’s document inspector tool shows the hidden secrets:

All of the red text above is hidden text. The actual hidden text is displayed by itself in the box on the right side of the screen above. It isn’t very pretty, but it has all of the details.

Proper Redaction

If you are concerned about keeping your secrets secret, do a bit of research into the tools available. You want to be absolutely certain that you don’t pass along any more information than you intend to.

Adobe Acrobat Professional comes with tools to do just this, and I show their use here:

20090421-redaction2

You can see that I have used a redaction tool to select scanned text. Acrobat is selecting the hidden text as well as the bitmap image of the page. Once I apply the redaction, you can see the result below:

Now when my enemy tries the old copy/paste trick, the stuff between 38 and said means is totally blank, as intended.

Summary

I covered a very simplistic form of redaction here as well as a very simple way of getting around someone’s naïve censoring. Don’t stop here. You should use your PDF editor to search the metadata and hidden text for any terms you don’t want made public. You may wish to strip all metadata from your documents.

This is a topic that has been covered in depth by many, particularly in the legal field. Here’s a few articles worth reading on the topic:

Control metadata in your legal documents (Microsoft)

Redaction and Metadata Removal eSeminar (Acrobat for Legal Professionals)

Redacting PDF files with Acrobat 8 (AcrobatUsers.com)

2 Responses to “Keeping your secrets to yourself—what can your shared documents tell others?”

  1. Arvind Ganesan writes:

    Hi Tad,

    Nice article on the topic of redaction and how many people get it wrong!

    We (at Extract Systems) specialize in fully automated redaction as well as semi-automated redaction (where a percentage of documents are selected for manual verification) using our ID Shield software. We also have a desktop version of our product called ID Shield Office which can be used on a case by case basis to produce redacted documents. Our output documents (even when they are in PDF format) are 100% free of any metadata and hidden content because we flatten out an intelligent document (like a PDF) into an image (i.e. raw pixels), process the image, and produce the redacted image (or optionally a redacted PDF from the redacted image). Let me know if you’d like to evaluate our software for free – I would be happy to arrange for a free evaluation copy for you.

    Best regards,
    Arvind Ganesan
    CTO, Extract Systems.

  2. Roy Brookes writes:

    The redaction mistakes you describe when attempting to redact sensitive information is surprisingly common. Many highly sensitive documents have been released using the ‘rectangle cover method’ you have described and as a result people have been able to extract the supposedly ‘redacted’ information.

    The RapidRedact product range (by Onstream Systems) has been specifically developed to overcome these common redaction mistakes and provides many solutions specifically developed for what users/organizations require in a redaction product.

    Our products provide a plethora of redaction tools which allow the user to customize how they want to redact their important documents. A limitless amount of information such as Social Security Numbers, Names, Credit Card Numbers, etc can be automatically redacted and all hidden meta-data belonging to the document is completely removed.

    RapidRedact software is capable of opening and redacting any document type including PDF and all office documents (.doc, .xls,. msg, etc) and we aim to provide the user complete control over the redaction process by providing many advanced redaction tools, but whilst keeping the process easy and intuitive.

    Comprehensive RapidRedact product Information can be viewed at http://www.rapidredact.com and a free trial of our Desktop product is available for download.

    Regards,

    Roy Brookes
    RapidRedact

Leave a Reply