Tesseract OCR for Porteus 2.1

14 Jul 2013

Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.
OCR stands for Optical Character Recognition and this is supposed to convert text images to digital text.
You can download it from here. There are 10 files to download; you can put them in a folder named OCR and put it in modules.

I packed the modules for the most used languages by members of the Porteus community. I tested a couple of images and the result was not so super accurate as they say, but better as to have to retype everything....

This is a CLI utility; the usage is (case of Spanish, default is English)
#tesseract textimage.png textvector -l spa

The result will be at textvector.txt in utf-8 format. Various image formats are supported!

Other languages I included were deu, ita, rus, por, fra, equ(=matemathical symbols). Caution, I only tested Spanish!
If you would like other languages, refer to the website https://code.google.com/p/tesseract-ocr/downloads/list

I hope this will be useful for you!
