Automate ScanSnap OCR process on your Mac with AppleScript
Saturday, 29 August 2009
Some months back I wrote an article on using scripting languages to glue workflows together. My inspiration for that article was a bit of AppleScript that I had suffered over in order to smooth over a minor annoyance of my scan-to-OCR workflow.
I had promised that once I cleaned up the embarrassing bits of code I would post a perfect polished version here, but such promises are rarely fulfilled. A reader posted a comment asking for that source code, so I will post it here in its current state. The truth is, I have been using this script for months and, though it has some quirks, it works fine.
So this post is about Macintosh, AppleScript, and the ScanSnap-to-FineReader workflow. If these don’t interest you, better move on.
Update: The script on this page works only with Leopard (10.5). Get the Snow Leopard version here
The Original Problem
The Fujitsu ScanSnap S510m, my workhorse scanner, was designed to scan documents quickly and generate PDF files—this it does flawlessly. In order to provide OCR support, they have shipped a special version of FineReader, called FineReader for ScanSnap. The standard OCR configuration is to chain the output of the scanner to the FineReader program.
The problem is that this forces scanning and OCR to run in lockstep: you scan a document, you wait for OCR, and then you scan another document.
My desire was to write a simple AppleScript that would detach the “Scan a Document” process from the “OCR” process. By using this script, I can scan documents at whatever rate pleases me, and the OCR engine will chunk along at its own pace, consuming my scanned documents and producing OCR documents.
My Approach
I really looked hard at the OCR application, trying to find AppleScript hooks or special command line switches that might allow me to control it better. Sadly, it was not designed to be scriptable. The only thing I could do is call the FineReader application with a source file.
Given this limitation, I considered writing a script that would look at a particular folder, identifying new files as they appear and passing them on to FineReader.
Fortunately, AppleScript provides this kind of functionality with little effort in the form of Folder Actions. Perhaps the best way to see these in action (and try it out) is to see this post on Exploring the power of Folder Actions.
In order to achieve my goals, I did the following:
- Created a folder called “Pending Documents”
- Wrote the script to find the oldest-unprocessed-file and call FineReader with it
- Attached the script to the folder as a Folder Action
The Script
Let’s jump right in to the AppleScript. Download the script here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 | (* This is a folder listener script that will act as a queue, receiving PDF files from the ScanSnap scanner and feeding them, one by one, to the Abbyy FineReader OCR software. This allows you to keep scanning while the OCR job runs in the background on all of the unprocessed files. Why do we want to do this? The ScanSnap Manager software does not support this by default, so when you scan in a file, it sends it to FineReader for OCR. You then must wait until FineReader finishes its work before scanning in another document. This script allows you to keep scanning without waiting for OCR. Installation: o Copy this script to: <home>/Library/Scripts/Folder Action Scripts You may have to create the "Folder Action Scripts" folder. o Now open a Finder window, control-click and choose: More / Configure Folder Actions... o Check the "Enable Folder Actions" checkbox, if not checked o Click the "+" in the bottom left o Select a folder and click Open o Choose the script "Run OCR on New Folder Items" and click Attach Copyright (C) 2009 Tad Harrison *) on adding folder items to this_folder after receiving added_items try -- Just in case FineReader is running, wait until it is ready waitForFineReaderFinish() set moreWorkToDo to true repeat while moreWorkToDo set aFile to getNextFile(this_folder) if not aFile = "" then log POSIX path of aFile ocrFile(aFile) else set moreWorkToDo to false end if end repeat exitApp("FineReader for ScanSnap") on error errorStr number errNum display dialog "Error " & errNum & " while running OCR: " & errorStr end try end adding folder items to (* Name: ocrFile Description: Runs OCR on the next un-OCR'd file Parameters: aFile - the file to be OCR'd *) on ocrFile(aFile) tell application "FineReader for ScanSnap" to open aFile -- Make sure FineReader actually starts before we start waiting for it to stop waitForFineReaderStart() -- Now wait 'till it's done so we do one file at a time waitForFineReaderFinish() end ocrFile (* Name: appIsRunning Description: Determines if a particular application is running. Parameters: appName - the name of the application to be tested Returns: True if the application is running; otherwise False *) on appIsRunning(appName) tell application "System Events" to (name of processes) contains appName end appIsRunning (* Name: exitApp Description: Exits the specified app if it is running. Parameters: appName - the application name *) on exitApp(appName) if appIsRunning(appName) then tell application appName to quit end if end exitApp (* Name: getNextFile Description: Finds the next unprocessed ScanSnap PDF Return: the file or "" *) on getNextFile(aFolder) set masterFileList to list folder aFolder ¬ without invisibles set posixPath to POSIX path of aFolder repeat with i from 1 to count masterFileList set fileName to item i of masterFileList set posixFilePath to posixPath & fileName log posixFilePath -- -- Construct a FineReader file name from our file -- set posixBaseName to do shell script ¬ "filename=" & quoted form of posixFilePath & "; echo ${filename%\\.*}" log ("Name: " & posixBaseName) set posixOcrFilePath to posixBaseName & " processed by FineReader.pdf" -- -- See if the FineReader file we constructed exists -- tell application "System Events" to set ocrFileExists to exists file posixOcrFilePath if ocrFileExists then log ("OCR file found for " & posixBaseName) end if tell me to set fileCreator to getSpotlightInfo for "kMDItemCreator" from posixFilePath log ("Creator: " & fileCreator) if not ocrFileExists and fileCreator = "ScanSnap Manager" then return POSIX file posixFilePath end if end repeat return "" end getNextFile (* Name: getSpotlightInfo Description: Gets a named attribute from metadata for a specific file. Parameters: for myattribute - the name of the attribute from myfile - the name of the file Returns: the attribute value or "" if none found *) on getSpotlightInfo for myattribute from myfile try set this_kMDItemResult to "" tell application "Finder" set this_item to myfile as string set this_item to POSIX path of this_item set this_kMDItem to myattribute set theResult to words of (do shell script "/usr/bin/mdls -name " & this_kMDItem & " -raw -nullMarker None " & quoted form of this_item) log "Result: " & theResult as string repeat with j from 1 to number of items in theResult set this_kMDItemResult to this_kMDItemResult & item j of theResult as string if j < number of items in theResult then set this_kMDItemResult to this_kMDItemResult & " " end if end repeat end tell on error set this_kMDItemResult to "" end try return this_kMDItemResult end getSpotlightInfo (* Name: waitForFineReaderFinish Description: Waits until FineReader OCR is complete. Returns: True if FineReader OCR is complete; otherwise False This procedure constantly loops through open FineReader windows looking for the window called "Converting the Document" Once that window goes away, the procedure exits. *) on waitForFineReaderFinish() if not appIsRunning("FineReader for ScanSnap") then return false end if tell application "System Events" set window_found to true repeat until not window_found set ew to name of every window of application process "FineReader for ScanSnap" if ew contains "Converting the Document" then set window_found to true delay 1 else set window_found to false end if end repeat end tell return true end waitForFineReaderFinish (* Name: waitForFineReaderStart Description: Waits until FineReader OCR has begun. Returns: True if FineReader OCR has started; otherwise False This procedure is used to give FineReader a moment to actually start chewing on a file. It simply waits for the "Converting the Document" window to appear. In order to avoid a permanent loop if FineReader doesn't start, this times out after 30 seconds. *) on waitForFineReaderStart() if not appIsRunning("FineReader for ScanSnap") then return false end if with timeout of 30 seconds tell application "System Events" set window_found to false repeat until window_found set ew to name of every window of application process "FineReader for ScanSnap" if ew contains "Converting the Document" then set window_found to true else set window_found to false delay 1 end if end repeat end tell end timeout return true end waitForFineReaderStart |
Installation
- Use the Script Editor to save this script as Run OCR on New Folder Items under User Home/Library/Scripts/Folder Action ScriptsYou may have to create the Folder Action Scripts folder.
- Now open a Finder window, control-click and choose More / Configure Folder Actions…
- Check the Enable Folder Actions checkbox, if not checked
- Click the “+” in the bottom left
- Select a folder and click Open
- Choose the script Run OCR on New Folder Items and click Attach
Picky Details
As you can see in the source code, there were several issues to address:
- I had to make sure the script didn’t step on itself. If FineReader was running, I would wait until it was ready before processing.
- The script needed to determine which files had been processed already. This was handled fairly trivially by looking for a matching file with the processed by FineReader.pdf suffix. In other words, if I was looking at Scan001.pdf, I would see if there was a matching Scan001 processed by FineReader.pdf file.
- Part of checking for a source file’s “buddy” was stripping off the PDF suffix. This was done in a hackish way by using a one-line shell script, at lines 106-107.
- I thought it was important to verify that the source file was, indeed, a ScanSnap file—the FineReader will not process other PDF documents. This was done at lines 117-121 by looking at the Spotlight metadata for the Creator of the source file. That took some more shell scripting (133-154).
- The actual work was done by a single line, line 63.
The real work was fairly simple, while the bulk of the code was needed to polish pesky little details. Isn’t that the way code development often is?
If anyone has any improvements on my script, please let me know!



No. 1 — August 29th, 2009 at 11:10 pm
Great idea– my Fine Reader is acting up right now, but this is an inspired script. Fujitsu should buy it from you!
No. 2 — August 31st, 2009 at 10:54 am
Hey, great. This gives me hope that there is a way to really automate the just-purchased copy of Abbyy FineReader Express. Looking at the script, it shouldn’t be too difficult to recursively go through a folder tree and convert, say all *.tif files to searchable PDF.
The only question is: will FR Express allow that?! When I purchased the product, it was partially because of the word “Automated”.
I have the ScanSnap scanner and the FR product that came with that also, and that makes me wonder if I over-purchased…
Do you have any experience with the FR Express product?
No. 3 — September 28th, 2009 at 3:01 am
Dear Tad,
I have been reading your blog posts and came across this one on OCR. I thought you might be interested to check out OCR Terminal as well.
This is an online OCR service that we provide. We have some neat features, such as
· 20 free page credits per month
· Supports many input formats, such as PDF, JPEG, TIFF, BMP, PNG and GIF
· Results can be downloaded as DOC, searchable PDF, RTF and plain text
· API to help developers integrate OCR functionality in their web apps
Besides OCR Terminal recently came out of Beta. So take a minute to try us out! We love to hear what you think about it.
Thanks!
Xuwen (Ms)
OCR Terminal
http://www.ocrterminal.com
No. 4 — November 11th, 2009 at 3:18 pm
Tad,
That’s a great idea and makes using the scanner so much more useful. It’s a shame that one cannot run multiple instances of FR in the first place or that there isn’t a built-in queu. While running your script, I still get a message from FR after scanning more than one document while OCR is underway. It says “Can’t open files while the recognition is under way”. Any idea how to fix this? Right now, one needs to press the OK button to proceed; if one doesn’t, FR stops at the “Saving the resulting file ….” step. Everything else works very well.
Thanks,
Carsten
No. 5 — November 11th, 2009 at 11:13 pm
Hi Carsten,
I agree about the FineReader limitations. The FR engine is an excellent piece of software, but they have purposefully made FineReader for ScanSnap to be a limited-feature product. So, we hack together scripts to get around some of these limitations.
I have run into the problem you describe in one situation: when I am running FineReader in one Space and I change to a different Leopard Spaces window. My scripting skillz just weren’t up to the task of figuring out if the app was showing a window in a different Leopard Spaces window.
Unless someone has a good fix for this, my own solution is to never switch to another space while processing is happening (though it would certainly be convenient to do so).
Now about FineReader’s hobbled functionality…
Ordinarily, I wouldn’t mind this so much, since it is given away with the scanner, but vendors typically provide limited versions like this when they want you to upgrade to their fancier product. Sadly, I just don’t get the feeling that Abbyy is really behind their full price Mac version of FineReader.
They offer no trial version, and there are sparse details about the scripting and batch processing capabilities of the product. As Hans did above, you simply have to trust that it is good and drop a hundred bucks on it.
If Abbyy were to make a better product for OS X, I would gladly pay a few hundred for it.
The shame is, the engine can do all of this and more. We are using a Linux-compiled version of the FineReader engine SDK in my workplace, with shell scripts and such, to index great loads of scientific papers. It runs happily on multiple processors, chewing up thousands of pages per day.
FineReader for the Macintosh is a powerful product, weakened by ho-hum client software.
No. 6 — November 11th, 2009 at 11:53 pm
Tad,
Thanks for the quick response. I checked your idea about maybe Spaces getting in the way by disabling that feature completely. I still get the same warning message from FR. Looking at your script, I don’t quite understand why that would be, since you are testing for an already existing conversion process before launching the next one. Oh well, AppleScript’s never been my forte.
I also think that FR is a pretty decent piece of software after having played with some of the other alternatives like Readiris Pro for Mac. I like the simplicity of FR and the consistently accurate results. Abbyy today released a new version of their FR Express for Mac, but there’s no trial version, no upgrade path, and no mention of whether queuing is supported. The only thing that is being mentioned is that one can open multiple documents for processing and merging; a somewhat vague statement.
I imagine that the script you wrote is being called every time a new document is being added to the scanned-documents folder. Does this mean that multiple instances of the script are running at the same time? If so, are they not getting in each other’s way? I guess one could add new documents to a queue file using either launchd or Folder Actions and have another script work through this list. I might have a whack at it … or at least try to completely understand your elaborate script;-)
Thanks,
Carsten
No. 7 — January 4th, 2010 at 8:52 pm
[...] time back I published an AppleScript that allows one to automatically run OCR in the background on scanned files generated by your Fujitsu ScanSnap, while you to continue scanning more files. ScanSnap owners [...]
No. 8 — January 11th, 2010 at 1:18 am
Dear Tad – Thanks for the excellent script. I am working with a group of people with dyslexia, trying to make access to the printed word easier. I have got them the latest scansnap s1500m (seems to work fine out of the box with snow leopard (v3.0 L10) and in setting up your script I realised that it has a setting [ ] Convert to searchable pdf. So now all we have to do is put the letter in the scanner > press blue button > go to scan folder > open pdf > select all text and then press [F5] to make the mac read the text. I am looking for someone who would be willing to automate this final part (apple script?) so that if [f5] is pressed after an oCR scan is done that it simply reads the letter. I am happy to pay for the script – can you help?
Best wishes Joe (joe AT venturacottage.com)