Scanning Documents: Lessons Learned

In a bit of continuity, I actually installed Ocropus and tried it out. Turns out, this really is beta software. There were a few too many loose nuts and bolts and I ended up going for a double headed VueScan and Evernote approach.

Ocropus itself takes bitmaps as input and does not do the actual scanning. For that, you need another tool. On Linux, XSane is pretty much the standard tool, which is also the one I used. And it turns out it’s also a good choice if you want to scan multiple pages. It will automatically scan a preset number of pages. You only need to make sure that you can place the page on your scanner before the scan begins (not always easy).

analogue
Creative Commons License photo credit: Seán Venn

Based on a set of images, Ocropus starts its analysis. First it splits the documents in blocks and lines. Next is the Optical Character Recognition (OCR), which is where I got stuck. The OCR in Ocropus is based on an AI solution that needs to be trained. Sadly Ocropus came with incomplete or outdated recognition models. You can train Ocropus and create your own model, but for that you need a set of reference documents. You can imagine the time required to create the reference docs (there is a basic one available) and to train the algorithm.

It was too much for me, so I started looking for other solutions. I did stumble upon Cuneiform, but it didn’t seem to like my documents very much. The results were abysmal. As were any of the other programs I tried.

Eventually, I had a revelation. Evernote is not only my favorite note-keeping program, it also has OCR build in. And, from the few tests I did, pretty good one at that. So why not create a PDF, dump it in Evernote and let it do the OCR?

There are several ways to scan and create a PDF. You might want to check out the software that came with your scanner. iCopy in combination with PDFCreator is a good open source solution. And last but not least, if you want it really easy and don’t mind spending a fairly small amount of money, get yourself VueScan.

MIni-DIY : Laptop Document Holder
Creative Commons License photo credit: mskogly

Once you have a PDF, just move it into an Evernote note and if you are a premium subscriber, the OCR will kick in. If you aren’t a premium subscriber, there’s no OCR, but with a little tagging, you can at least very quickly retrieve the document. Something you weren’t able with the paper version. It’s already a big improvement.

Lessons learned:

  • Good document scanning, OCR and management isn’t easy. If you’re serious about it, some investment in good software and hardware will pay off big time.
  • If you’re going all the way with converting paper to digital, you need a dedicated scanner with automatic feeder. There really is no way around it. A flatbed scanner will only suffice if you are already digital and have only few documents arriving on paper. Which is very unlikely.
  • Most decent scanners will include some good software, which sort-of eliminates many of the issues discussed in this post. It does not solve the archiving though.
This entry was posted in On Streamhead. Bookmark the permalink. Both comments and trackbacks are currently closed.

2 Comments

  1. Mike
    Posted November 28, 2009 at 10:35 am | Permalink

    Hi Peter,

    Just a short comment on the OCR part. ABBYY offers it OCR technology also for Linux. It uses to be an SDK only but now they have a Command line OCR version as well. They can generate searchable PDFs right away:

    TextOnly
    The recognised text will be saved as text, and the pictures will be saved as pictures. The original document layout will not be retained.

    TextOnImage
    The entire page image will be saved as a picture. Text areas will be saved as text over the picture.

    ImageOnText
    The entire page image will be saved as a picture. The recognised text is 'written' under the picture. This option is useful if you export text to document archives: the full page layout will be retained and the full-text search will be available.

    ImageOnly
    The entire page image will be saved as a picture.

    http://www.ocr4linux.com/Documentation/ABBYYOCR…

    There is a free trial availabe ;o)

    BR
    Mike

  2. Mike
    Posted November 28, 2009 at 10:40 am | Permalink

    Hi Peter,

    Just a short comment on the OCR part. ABBYY offers it OCR technology also for Linux. It uses to be an SDK only but now they have a Command line OCR version as well. They can generate searchable PDFs right away:

    TextOnly
    The recognised text will be saved as text, and the pictures will be saved as pictures. The original document layout will not be retained.

    TextOnImage
    The entire page image will be saved as a picture. Text areas will be saved as text over the picture.

    ImageOnText
    The entire page image will be saved as a picture. The recognised text is 'written' under the picture. This option is useful if you export text to document archives: the full page layout will be retained and the full-text search will be available.

    ImageOnly
    The entire page image will be saved as a picture.

    http://www.ocr4linux.com/Documentation/ABBYYOCR…

    There is a free trial availabe ;o)

    BR
    Mike

One Trackback

  1. [...] Development, if You Can Handle the XML By Peter Backx | Published: December 15, 2009 As part of my digitization and cleanup effort, I’m going through a lot of documents. One document I discovered is an old printout of a [...]

  • Feedback or questions? Contact me right away.

    Comments have been disabled on my posts. Not because I don't want to hear from you, but because they were adding very little to the conversation (most of them were spam anyway). I do listen to you and try to keep as much posts as possible up-to-date and error free. So if you have a question, if something isn't working the way you hoped or you have general feedback, please use the contact form below. I guarantee an answer to every honest question or remark.
  • Get in touch
    1. (required)
    2. (valid email required)
     

    cforms contact form by delicious:days