PDF-OCR: Sorting documents into searchable PDFs

I’v gotten rid of paper at home by installing an automatic scanner/OCR/document sorting system on based on an all-in-one printer-scanner and a raspberry pi.

For years I’ve been struggling to keep up with bureaucracy. I do really dislike everything to do with official papers. In most years that meant that I would just briefly read official letters and documents before putting them in a box. That summarizes my sorting system pretty well. Come the end of the year I would take a day or two to sort them into folders by category. I’ll never understand why, at least for those letters, we have not yet gone digital. In Germany the laws would have permitted that for more than 10 years.

Since I started my PhD I have been making an effort be more careful about my bureaucracy. I began to use the printer/scanner combination at work to archive a digital version of the most important documents in order to be able to find them quickly. But for most documents they still lived at home in a box. The main reason was: I was reluctant to bring a box of personal documents to work and scan them, even off-hours it seemed inappropriate.

Then my father told me that his all-in-one printer supposedly does OCR (optical character recognition) on documents he scans (unlike the machine at the office). OCR means that your PDF is not made of of images, as it is with most scanners. The computer also reconstructs the text from the images and allows you to search through the PDF and jump to a page as you can in PDFs that you generate from, say, Word documents. Searchable PDFs of course have the important quality of being searchable. In theory you don’t have to sort them at all.

In practice you may want some presorting, say by the company that sent you a letter. But that is something you can do easily once you have searchable PDFs.

When I bought the printer/scanner I paid attention that it offers the possibility to scan to a network drive without having a computer attached. This way it can directly deposit scanned documents on the hard disk of our network attached storage system (we’ve got a synology DS213 but really any NAS would be fine).

For the scanning I thought I’d use the other computer constantly running in our home, a Raspberry Pi whose tasks so far include logging of the temperature in different rooms and remote control of power outlets. At first I thought I’d have to do everything myself but soon found that somebody had already done the work: pypdfocr, a great python software by Virantha Ekanayake takes multipage-image-only PDFs as an input, disassembles them, runs them trough the open source tesseract OCR engine and puts them back together as conveniently searchable PDFs. Then it puts the PDFs into folders depending on configurable keywords (think “Invoice”, “Insurance”, “Tax”)

More than that, it can conveniently be installed from the python package index (PyPI) using the command

pip install pypdfocr

The first time running it on the Raspberry Pi though the output was unfortunately not searchable PDFs. In my case the reason was that the tesseract-version on the Raspberry Pi package repositories was either outdated or a modified version. The fix in my case was downloading tesseract from Google (they develop it) at google code and compiling it myself. The necessary steps are:

download tesseract and unpack it on the raspberry pi

run the setup and compilation from main folder of the source:
```
./configure
make && sudo make install
```
If your raspberry pi complains during the configuration/compilation that software or libraries are missing install them using the package manger. Pay attention that you might need the -dev version of the libraries.

On top of that I wanted to make sure that pypdfocr automatically scans all PDFs that go into the incoming folder on the NAS. To do that did the following:

I mounted the documents directory from the nas on

/nfs/documents

I instructed the scanner to scan into the documents directory, subfolder

paul

I wrote a short shell script that regularly checks for new PDFs and if it finds any runs

pypdfocr

#!/bin/sh
while sleep 60 #every minute
do
  for i in /nfs/documents/paul/*.pdf #for all pdfs in the folder
    do
    echo "found file " \$i #output name
    sleep 20 # wait for 20 seconds to make sure it is there
    pypdfocr \$i -f -v -c config.yaml # run pypdfocr.
  done
done

In theory, pypdfocr can do the last step itself (heck it can even upload the stuff to evernote if you’re into that). However depending on your system of network shares you can not always be sure that a file is correctly locked. Then it can happen that you start the conversion process on a file that is currently being written by the scanner. In my case the scanner usually takes less than 10 seconds to write a file. Therefore I wait for 20 seconds after I notice a new file. This way I am sure the file is fully on the NAS before starting the conversion.

The file config.yaml contains a dictionary of disk folders and corresponding keywords. If a the text in a scanned document matches a keyword, it automatically gets sorted into the folder on the disk.