Scanning Documents: Lessons Learned

In a bit of continuity, I actually installed Ocropus and tried it out. Turns out, this really is beta software. There were a few too many loose nuts and bolts and I ended up going for a double headed VueScan and Evernote approach.

Ocropus itself takes bitmaps as input and does not do the actual scanning. For that, you need another tool. On Linux, XSane is pretty much the standard tool, which is also the one I used. And it turns out it’s also a good choice if you want to scan multiple pages. It will automatically scan a preset number of pages. You only need to make sure that you can place the page on your scanner before the scan begins (not always easy).

photo credit: Seán Venn

Based on a set of images, Ocropus starts its analysis. First it splits the documents in blocks and lines. Next is the Optical Character Recognition (OCR), which is where I got stuck. The OCR in Ocropus is based on an AI solution that needs to be trained. Sadly Ocropus came with incomplete or outdated recognition models. You can train Ocropus and create your own model, but for that you need a set of reference documents. You can imagine the time required to create the reference docs (there is a basic one available) and to train the algorithm.

It was too much for me, so I started looking for other solutions. I did stumble upon Cuneiform, but it didn’t seem to like my documents very much. The results were abysmal. As were any of the other programs I tried.

Eventually, I had a revelation. Evernote is not only my favorite note-keeping program, it also has OCR build in. And, from the few tests I did, pretty good one at that. So why not create a PDF, dump it in Evernote and let it do the OCR?

There are several ways to scan and create a PDF. You might want to check out the software that came with your scanner. iCopy in combination with PDFCreator is a good open source solution. And last but not least, if you want it really easy and don’t mind spending a fairly small amount of money, get yourself VueScan.

photo credit: mskogly

Once you have a PDF, just move it into an Evernote note and if you are a premium subscriber, the OCR will kick in. If you aren’t a premium subscriber, there’s no OCR, but with a little tagging, you can at least very quickly retrieve the document. Something you weren’t able with the paper version. It’s already a big improvement.

Lessons learned:

Good document scanning, OCR and management isn’t easy. If you’re serious about it, some investment in good software and hardware will pay off big time.
If you’re going all the way with converting paper to digital, you need a dedicated scanner with automatic feeder. There really is no way around it. A flatbed scanner will only suffice if you are already digital and have only few documents arriving on paper. Which is very unlikely.
Most decent scanners will include some good software, which sort-of eliminates many of the issues discussed in this post. It does not solve the archiving though.

Related Posts

Why I Hardly Blog Anymore 03 Jan 2025

Deep Learning Lego Sorter 01 Nov 2021

Robots and Puzzles 07 May 2020