[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Using Hooks To OCR Documents

From: Ryan Schmidt <subversion-2010d_at_ryandesign.com>
Date: Mon, 6 Dec 2010 02:02:22 -0600

On Dec 3, 2010, at 09:44, Jim Jenkins wrote:

> I’m planning to use Hooks to add OCR scanning for select documents going into a SVN repo. I’m not really sure where to start so I’m hoping someone here can tell me if it’s possible and even suggest how best to proceed.
>
> Basically I’d like to have every commit to an SVN repo stop at the pre-commit (or another more suitable) hook so the submitted files can be inspected and if needed run through a command line OCR engine. We are dealing with “image” based PDF files so these would be sent off to the OCR engine and a “test+image” PDF would be returned. The new PDF would replace the original before being sent on it’s way into the SVN repo.

Some of this is possible, assuming that you will automate everything, including the process of deciding whether or not to OCR the document. (Hook scripts run on the server and are not interactive.)

Here's an example pre-commit hook which checks the syntax of any committed Java files:

http://svn.haxx.se/users/archive-2006-06/0853.shtml

You could change the criteria from "extension .java" to whatever your criteria is ("extension .pdf", maybe, and then some other check to see if the PDF is image-based), and change the action from running checkstyle to running your OCR program.

What's not possible is changing the content of the incoming transaction, as you propose. You must either accept the transaction as-is (by returning 0 from your pre-commit hook script), or reject it (by returning any other number). So you could do that, and if an incoming PDF is image-based, reject the commit and inform the user they must run the OCR program on it first.

I have a pre-commit script on my repository doing something similar: I run pngcrush on committed PNGs, and if I find a PNG that would benefit from being crushed, I reject the commit and tell the user to pngcrush it and then try the commit again.

That would be the preferred way to do things. But if it will be too difficult for your users to run the OCR program themselves and you want to automate the process server-side, an alternative is to accept the commit -- not run any of these checks in the pre-commit -- and run your script at post-commit time instead. If you detect that a just-committed revision contains an image-based PDF that you can OCR, then OCR it, and replace it, in a second commit initiated by the post-commit script. This is trickier because the hook script might then have to manage a working copy (check out the directory, change the PDF to the OCR'd one, commit, delete the working copy). This is fraught with problems such as: What happens if the post-commit script decides to act on the PDF that's being committed by the post-commit script? (Infinite loop?) What happens if someone manages to commit another revision to that PDF before the hook script is done committing its revision? Perhaps that's not likely. But commits can fail for many reasons, which the script would either have to anticipate and deal with, or log or email failures for someone to deal with manually. There's also the problem that a user who committed an image-based PDF would then immediately have an out-of-date working copy, which is not expected in normal Subversion usage, though you could train your users to understand this and recommend they run "svn up" again shortly after committing. Or, if your script does replace a PDF, you could inform the user via out-of-band means (email, instant message, etc.) that they should run "svn up".
Received on 2010-12-06 09:03:07 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.