This post describes how to scan pages from a printed book and convert the image to text using Optical Character Recognition (OCR) technology.
The tools that I use are:
SimpleScan is a GUI scan application that comes pre-installed in many Linux distributions (including Debian Wheezy).
To manually install it on Debian:
$ sudo apt-get install simple-scan
tesseract is a command-line OCR program.
$ sudo apt-get install tesseract-ocr
If English is the language used, that is all you need to install. If you require another language, you must install additional tesseract language packs. Examples are tesseract-ocr-rus for Russian, tesseract-ocr-deu for German, and tesseract-ocr-fra for French.
- Scan the pages using SimpleScan.
- Save the image.
- Run the tesseract command:
$ tesseract OnWritingWell.jpg out Tesseract Open Source OCR Engine v3.02 with Leptonica
The first parameter is the input image filename. The second parameter is the desired basename of the output text file. The default txt extension is added to the basename, e.g., out.txt.
If the language is not English, you need to specify the language on the command line using a 3-character language code (refer to the tesseract man page). The following command specifies the use of 3 languages: Russian, German and French.
$ tesseract OnWritingWell.jpg myout -l rus+deu+fra
In the above example, there were a total of 734 words. Within the output text file, 119 words (16% of total) require some form of manual correction. This roughly translates to 84% OCR accuracy. The sample size is too small to be scientific, or statistically valid. What is the performance that you are getting from OCR?