Automate ScanSnap OCR process on your Mac with AppleScript

Some months back I wrote an article on using scripting languages to glue workflows together. My inspiration for that article was a bit of AppleScript that I had suffered over in order to smooth over a minor annoyance of my scan-to-OCR workflow.

I had promised that once I cleaned up the embarrassing bits of code I would post a perfect polished version here, but such promises are rarely fulfilled. A reader posted a comment asking for that source code, so I will post it here in its current state. The truth is, I have been using this script for months and, though it has some quirks, it works fine.

So this post is about Macintosh, AppleScript, and the ScanSnap-to-FineReader workflow. If these don’t interest you, better move on.

Update: The script on this page works only with Leopard (10.5). Get the Snow Leopard version here

The Original Problem

The Fujitsu ScanSnap S510m, my workhorse scanner, was designed to scan documents quickly and generate PDF files—this it does flawlessly. In order to provide OCR support, they have shipped a special version of FineReader, called FineReader for ScanSnap. The standard OCR configuration is to chain the output of the scanner to the FineReader program.

The problem is that this forces scanning and OCR to run in lockstep: you scan a document, you wait for OCR, and then you scan another document.

My desire was to write a simple AppleScript that would detach the “Scan a Document” process from the “OCR” process. By using this script, I can scan documents at whatever rate pleases me, and the OCR engine will chunk along at its own pace, consuming my scanned documents and producing OCR documents.

My Approach

I really looked hard at the OCR application, trying to find AppleScript hooks or special command line switches that might allow me to control it better. Sadly, it was not designed to be scriptable. The only thing I could do is call the FineReader application with a source file.

Given this limitation, I considered writing a script that would look at a particular folder, identifying new files as they appear and passing them on to FineReader.

Fortunately, AppleScript provides this kind of functionality with little effort in the form of Folder Actions. Perhaps the best way to see these in action (and try it out) is to see this post on Exploring the power of Folder Actions.

In order to achieve my goals, I did the following:

  • Created a folder called “Pending Documents”
  • Wrote the script to find the oldest-unprocessed-file and call FineReader with it
  • Attached the script to the folder as a Folder Action

The Script

Let’s jump right in to the AppleScript. Download the script here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
(*
This is a folder listener script that will act as a queue, receiving
PDF files from the ScanSnap scanner and feeding them, one by one, to
the Abbyy FineReader OCR software.

This allows you to keep scanning while the OCR job runs in the background
on all of the unprocessed files.

Why do we want to do this?

The ScanSnap Manager software does not support this by default, so
when you scan in a file, it sends it to FineReader for OCR. You then
must wait until FineReader finishes its work before scanning in another
document.

This script allows you to keep scanning without waiting for OCR.

Installation:

o   Copy this script to:

    <home>/Library/Scripts/Folder Action Scripts

    You may have to create the "Folder Action Scripts" folder.

o   Now open a Finder window, control-click and choose:

    More / Configure Folder Actions...

o   Check the "Enable Folder Actions" checkbox, if not checked
o   Click the "+" in the bottom left
o   Select a folder and click Open
o   Choose the script "Run OCR on New Folder Items" and click Attach

Copyright (C) 2009 Tad Harrison
*)

on adding folder items to this_folder after receiving added_items
    try
        -- Just in case FineReader is running, wait until it is ready
        waitForFineReaderFinish()
        set moreWorkToDo to true
        repeat while moreWorkToDo
            set aFile to getNextFile(this_folder)
            if not aFile = "" then
                log POSIX path of aFile
                ocrFile(aFile)
            else
                set moreWorkToDo to false
            end if
        end repeat
        exitApp("FineReader for ScanSnap")
    on error errorStr number errNum
        display dialog "Error " & errNum & " while running OCR: " & errorStr
    end try
end adding folder items to
(*
Name: ocrFile
Description: Runs OCR on the next un-OCR'd file
Parameters:
  aFile - the file to be OCR'd
*)

on ocrFile(aFile)
    tell application "FineReader for ScanSnap" to open aFile
    -- Make sure FineReader actually starts before we start waiting for it to stop
    waitForFineReaderStart()
    -- Now wait 'till it's done so we do one file at a time
    waitForFineReaderFinish()
end ocrFile
(*
Name: appIsRunning
Description: Determines if a particular application is running.
Parameters:
    appName - the name of the application to be tested
Returns: True if the application is running; otherwise False
*)

on appIsRunning(appName)
    tell application "System Events" to (name of processes) contains appName
end appIsRunning
(*
Name: exitApp
Description: Exits the specified app if it is running.
Parameters:
    appName - the application name
*)

on exitApp(appName)
    if appIsRunning(appName) then
        tell application appName to quit
    end if
end exitApp
(*
Name: getNextFile
Description: Finds the next unprocessed ScanSnap PDF
Return: the file or ""
*)

on getNextFile(aFolder)
    set masterFileList to list folder aFolder ¬
        without invisibles
    set posixPath to POSIX path of aFolder
    repeat with i from 1 to count masterFileList
        set fileName to item i of masterFileList
        set posixFilePath to posixPath & fileName
        log posixFilePath
        --
        -- Construct a FineReader file name from our file
        --
        set posixBaseName to do shell script ¬
            "filename=" & quoted form of posixFilePath & "; echo ${filename%\\.*}"
        log ("Name: " & posixBaseName)
        set posixOcrFilePath to posixBaseName & " processed by FineReader.pdf"
        --
        -- See if the FineReader file we constructed exists
        --
        tell application "System Events" to set ocrFileExists to exists file posixOcrFilePath
        if ocrFileExists then
            log ("OCR file found for " & posixBaseName)
        end if
        tell me to set fileCreator to getSpotlightInfo for "kMDItemCreator" from posixFilePath
        log ("Creator: " & fileCreator)
        if not ocrFileExists and fileCreator = "ScanSnap Manager" then
            return POSIX file posixFilePath
        end if
    end repeat
    return ""
end getNextFile
(*
Name: getSpotlightInfo
Description: Gets a named attribute from metadata for a specific file.
Parameters:
    for myattribute - the name of the attribute
    from myfile - the name of the file
Returns: the attribute value or "" if none found
*)

on getSpotlightInfo for myattribute from myfile
    try
        set this_kMDItemResult to ""
       
        tell application "Finder"
            set this_item to myfile as string
            set this_item to POSIX path of this_item
            set this_kMDItem to myattribute
            set theResult to words of (do shell script "/usr/bin/mdls -name " & this_kMDItem & " -raw -nullMarker None " & quoted form of this_item)
            log "Result: " & theResult as string
            repeat with j from 1 to number of items in theResult
                set this_kMDItemResult to this_kMDItemResult & item j of theResult as string
                if j < number of items in theResult then
                    set this_kMDItemResult to this_kMDItemResult & " "
                end if
            end repeat
        end tell
    on error
        set this_kMDItemResult to ""
    end try
    return this_kMDItemResult
end getSpotlightInfo
(*
Name: waitForFineReaderFinish
Description: Waits until FineReader OCR is complete.
Returns: True if FineReader OCR is complete; otherwise False

This procedure constantly loops through open FineReader windows looking
for the window called "Converting the Document"
Once that window goes away, the procedure exits.
*)

on waitForFineReaderFinish()
    if not appIsRunning("FineReader for ScanSnap") then
        return false
    end if
    tell application "System Events"
        set window_found to true
        repeat until not window_found
            set ew to name of every window of application process "FineReader for ScanSnap"
            if ew contains "Converting the Document" then
                set window_found to true
                delay 1
            else
                set window_found to false
            end if
        end repeat
    end tell
    return true
end waitForFineReaderFinish
(*
Name: waitForFineReaderStart
Description: Waits until FineReader OCR has begun.
Returns: True if FineReader OCR has started; otherwise False

This procedure is used to give FineReader a moment to actually start
chewing on a file. It simply waits for the "Converting the Document"
window to appear.
In order to avoid a permanent loop if FineReader doesn't
start, this times out after 30 seconds.
*)

on waitForFineReaderStart()
    if not appIsRunning("FineReader for ScanSnap") then
        return false
    end if
    with timeout of 30 seconds
        tell application "System Events"
            set window_found to false
            repeat until window_found
                set ew to name of every window of application process "FineReader for ScanSnap"
                if ew contains "Converting the Document" then
                    set window_found to true
                else
                    set window_found to false
                    delay 1
                end if
            end repeat
        end tell
    end timeout
    return true
end waitForFineReaderStart

Installation

  • Use the Script Editor to save this script as Run OCR on New Folder Items under User Home/Library/Scripts/Folder Action ScriptsYou may have to create the Folder Action Scripts folder.
  • Now open a Finder window, control-click and choose More / Configure Folder Actions…
  • Check the Enable Folder Actions checkbox, if not checked
  • Click the “+” in the bottom left
  • Select a folder and click Open
  • Choose the script Run OCR on New Folder Items and click Attach

Picky Details

As you can see in the source code, there were several issues to address:

  • I had to make sure the script didn’t step on itself. If FineReader was running, I would wait until it was ready before processing.
  • The script needed to determine which files had been processed already. This was handled fairly trivially by looking for a matching file with the processed by FineReader.pdf suffix. In other words, if I was looking at Scan001.pdf, I would see if there was a matching Scan001 processed by FineReader.pdf file.
  • Part of checking for a source file’s “buddy” was stripping off the PDF suffix. This was done in a hackish way by using a one-line shell script, at lines 106-107.
  • I thought it was important to verify that the source file was, indeed, a ScanSnap file—the FineReader will not process other PDF documents. This was done at lines 117-121 by looking at the Spotlight metadata for the Creator of the source file. That took some more shell scripting (133-154).
  • The actual work was done by a single line, line 63.

The real work was fairly simple, while the bulk of the code was needed to polish pesky little details. Isn’t that the way code development often is?

If anyone has any improvements on my script, please let me know!

8 Responses to “Automate ScanSnap OCR process on your Mac with AppleScript”

  1. Ari writes:

    Great idea– my Fine Reader is acting up right now, but this is an inspired script. Fujitsu should buy it from you!

  2. Hans Baumeister writes:

    Hey, great. This gives me hope that there is a way to really automate the just-purchased copy of Abbyy FineReader Express. Looking at the script, it shouldn’t be too difficult to recursively go through a folder tree and convert, say all *.tif files to searchable PDF.
    The only question is: will FR Express allow that?! When I purchased the product, it was partially because of the word “Automated”.
    I have the ScanSnap scanner and the FR product that came with that also, and that makes me wonder if I over-purchased…
    Do you have any experience with the FR Express product?

  3. Xuwen writes:

    Dear Tad,

    I have been reading your blog posts and came across this one on OCR. I thought you might be interested to check out OCR Terminal as well.
    This is an online OCR service that we provide. We have some neat features, such as

    · 20 free page credits per month

    · Supports many input formats, such as PDF, JPEG, TIFF, BMP, PNG and GIF

    · Results can be downloaded as DOC, searchable PDF, RTF and plain text

    · API to help developers integrate OCR functionality in their web apps

    Besides OCR Terminal recently came out of Beta. So take a minute to try us out! We love to hear what you think about it.

    Thanks!

    Xuwen (Ms)
    OCR Terminal
    http://www.ocrterminal.com

  4. Carsten writes:

    Tad,

    That’s a great idea and makes using the scanner so much more useful. It’s a shame that one cannot run multiple instances of FR in the first place or that there isn’t a built-in queu. While running your script, I still get a message from FR after scanning more than one document while OCR is underway. It says “Can’t open files while the recognition is under way”. Any idea how to fix this? Right now, one needs to press the OK button to proceed; if one doesn’t, FR stops at the “Saving the resulting file ….” step. Everything else works very well.

    Thanks,
    Carsten

  5. Tad writes:

    Hi Carsten,

    I agree about the FineReader limitations. The FR engine is an excellent piece of software, but they have purposefully made FineReader for ScanSnap to be a limited-feature product. So, we hack together scripts to get around some of these limitations.

    I have run into the problem you describe in one situation: when I am running FineReader in one Space and I change to a different Leopard Spaces window. My scripting skillz just weren’t up to the task of figuring out if the app was showing a window in a different Leopard Spaces window.

    Unless someone has a good fix for this, my own solution is to never switch to another space while processing is happening (though it would certainly be convenient to do so).

    Now about FineReader’s hobbled functionality…

    Ordinarily, I wouldn’t mind this so much, since it is given away with the scanner, but vendors typically provide limited versions like this when they want you to upgrade to their fancier product. Sadly, I just don’t get the feeling that Abbyy is really behind their full price Mac version of FineReader.

    They offer no trial version, and there are sparse details about the scripting and batch processing capabilities of the product. As Hans did above, you simply have to trust that it is good and drop a hundred bucks on it.

    If Abbyy were to make a better product for OS X, I would gladly pay a few hundred for it.

    The shame is, the engine can do all of this and more. We are using a Linux-compiled version of the FineReader engine SDK in my workplace, with shell scripts and such, to index great loads of scientific papers. It runs happily on multiple processors, chewing up thousands of pages per day.

    FineReader for the Macintosh is a powerful product, weakened by ho-hum client software.

  6. Carsten writes:

    Tad,

    Thanks for the quick response. I checked your idea about maybe Spaces getting in the way by disabling that feature completely. I still get the same warning message from FR. Looking at your script, I don’t quite understand why that would be, since you are testing for an already existing conversion process before launching the next one. Oh well, AppleScript’s never been my forte.

    I also think that FR is a pretty decent piece of software after having played with some of the other alternatives like Readiris Pro for Mac. I like the simplicity of FR and the consistently accurate results. Abbyy today released a new version of their FR Express for Mac, but there’s no trial version, no upgrade path, and no mention of whether queuing is supported. The only thing that is being mentioned is that one can open multiple documents for processing and merging; a somewhat vague statement.

    I imagine that the script you wrote is being called every time a new document is being added to the scanned-documents folder. Does this mean that multiple instances of the script are running at the same time? If so, are they not getting in each other’s way? I guess one could add new documents to a queue file using either launchd or Folder Actions and have another script work through this list. I might have a whack at it … or at least try to completely understand your elaborate script;-)

    Thanks,
    Carsten

  7. Automate ScanSnap OCR process on your Mac with AppleScript (Snow Leopard Edition) | Paper Jammed writes:

    [...] time back I published an AppleScript that allows one to automatically run OCR in the background on scanned files generated by your Fujitsu ScanSnap, while you to continue scanning more files. ScanSnap owners [...]

  8. Joe Thompson writes:

    Dear Tad – Thanks for the excellent script. I am working with a group of people with dyslexia, trying to make access to the printed word easier. I have got them the latest scansnap s1500m (seems to work fine out of the box with snow leopard (v3.0 L10) and in setting up your script I realised that it has a setting [ ] Convert to searchable pdf. So now all we have to do is put the letter in the scanner > press blue button > go to scan folder > open pdf > select all text and then press [F5] to make the mac read the text. I am looking for someone who would be willing to automate this final part (apple script?) so that if [f5] is pressed after an oCR scan is done that it simply reads the letter. I am happy to pay for the script – can you help?
    Best wishes Joe (joe AT venturacottage.com)

Leave a Reply