Carefully inspect all scanned documents
Monday, 9 February 2009
Today I sat down to scan in a few lengthy documents, on the order of hundreds of pages. I carefully riffled them to avoid stuck sheets, dropped them in the scanner, and pressed the button.
After the scanning was complete, I combined the resulting files into one and prepared to run OCR on the whole lot.
Before starting that long-running process, however, I quickly paged through the final document looking for glaring errors.
As it turned out, my scanner software had been fooled by some pages that had large technical drawings on them, and the software had automatically flipped a handful of pages. Good thing I checked before trying to OCR upside-down text!
This is what I look for when I do my quick inspection of my work…
Missing Pages
If your source document has page numbers, this is fairly easy to check. I simply count off, hitting the page-down key ten times, then verify that the page number incremented exactly ten pages. I can go through a three-hundred page document in a minute or two using this technique, and I can be certain that I got all of the pages.
Why would there be any pages missing? Usually, it is because a couple of pages stuck together and went through the scanner as one sheet.
You can help prevent this problem by carefully riffling all of the pages, top and bottom, prior to scanning.
Incorrect Orientation
Sometimes scanner software (like mine) thinks its pretty smart and automatically interprets the orientation of the scanned pages. This is usually a good thing, since many items we scan are printed in “landscape” fashion, requiring a 90-degree rotation to be correct. However, sometimes the scanner will make a mistake in a large document, resulting in one or more flipped pages.
You can either turn off the auto-rotation feature of your software, or you can fix any flipped pages by hand. I choose the latter, mostly because I’m certain I will forget to turn the auto-rotation back on.
Document Feed Problems
You have seen it before: you flip through the pages you recently scanned in, only to find one that is on some totally cockeyed angle. There’s nothing you can do about this once it has happened, other than to rescan the sheet and insert it in the document in the right place.
You can try to prevent feed problems by verifying that the pages aren’t stuck together at all, by a folded corner perhaps, and by verifying that the pages don’t have any dog ears or tears that might catch on the mechanism as they go through.
If I have a document with a tear in it that won’t feed, I will try one of the following: scan it upside-down; scan it in the plastic sheet feeder that came with my scanner; use a bit of transparent tape to cover the tear; or scan it in a flatbed scanner.
Illegible Scans
Sometimes you will realize that something made your scan less than usable—perhaps a moiré pattern appeared as a result of the scan process, or perhaps your resolution just wasn’t good enough to capture the fine detail.
In this case, you will need to make adjustments to your scanning software and rescan.
Once you have discarded the original pages, you can never reconstruct information that was lost. Always inspect the finished product.



No. 1 — December 23rd, 2009 at 5:57 am
Hi there! Found your site whilst googling an issue I am having. I read through this article and definitely picked up some great tips. However, I was wondering if you ever came across this particular issue:
Using a Xerox Documate 632 scanner with feed or glass scanning, I scan a series (or just one page) of horizontally aligned (i.e. landscape pages). These are basically Excel tables filled with text – addresses, names etc… The pages are read horizontally i.e. left to right.
Now the scanner does OCR at 300dpi. However it is OCR reading the page from top to bottom. In other words when I try and highlight text in the document it highlights in a downwards fashion, whereas the text should be highlighting as it reads – left to right.
Have you seen anything like this before? If so, do you have any recommendations to ensure the scan does OCR in the correct direction?
Any advice would be greatly appreciated.
Thanks!
No. 2 — December 31st, 2009 at 8:21 pm
Hi Sean,
The direction of copy/paste highlighting is all about how the PDF document is generated, and this is going to be a direct result of the OCR process.
The scanner typically generates TIFF or PDF with raw images (in your case 300dpi). The OCR software then processes those PDF documents and generates new documents that have the text layered on top of the image.
This way, you still see the original image, while the text can be selected and copied.
Something in this process is not set correctly; the scanner and the OCR software are somehow out of sync. If I understand correctly, the invisible OCR text has been laid out incorrectly over the visible image, or possibly the text flow has been configured in a columnar fashion.
This is definitely a OCR software configuration issue.
Your scanner seems to be shipped with OmniPage Pro as well as Visioneer OneTouch with Kofax VRS. I’m not sure if the latter duo provides OCR, but OmniPage Pro definitely does so. Which of these products performs the OCR in your workflow?
It might be possible that there is a setting that configures how spreadsheets are treated, columnar or row flow. Perhaps it is as simple as telling OmniPage Pro to work in Spreadsheet mode (an option I see in the online docs).