gImageReader: Open source, Google-powered OCR (optical character recognition) program that actually works

Optical character recognition is one of a few types of technology meant to make our lives easier. The only problem is, to take advantage of this convenience, one typically has to shell out a lot of cash: Good OCR programs are bloody expensive. Luckily, Google is in the business of making things better and free; and OCR is no exception.

Image credit: textures, logo

Tesseract OCR and gImageReader

Tesseract OCR is an optical character recognition engine that was originally developed and maintained by HP from 1985-1995. At its peak, Tesseract was considered one of the best OCR engines out there. After 1995 HP stop putting much effort into Tesseract; and in 2005 HP released Tesseract’s source code. Since then, Google has been updating and maintaining Tesseract and today Tesseract is once again considered to be one of the most powerful OCR engines available. The only problem with Tesseract is that for the common user it is a pain in the [bleep] to use. This is where gImageReader comes in.

gImageReader is a program that serves as a GUI to Tesseract; it users Tesseract to process OCR but adds on an interface that the common man can use.

Getting Setup

First and foremost, you need to download gImageReader and Tesseract OCR; they are two separate downloads. (Download links are available at the end of this article.) Once you download them both, they both need to be installed. (Duh.)

Once you have both gImageReader and Tesseract installed, open gImageReader. A Configuration dialog will be the first thing you see. At the Configuration window, you need to do two things:

  • Type in the path to your Tesseract installation:

Unless you specifically changed it, the path for Tesseract is C:\Program Files\Tesseract-OCR for 32-bit machines and C:\Program Files (x86)\Tesseract-OCR for 64-bit machines.

Take note that the Tesseract path may be automatically filled in for you. If this is the case, just confirm that is it right – no need to change anything (unless its wrong, in which case you do need to change it).

  • Enter the path to Tesseract dictionaries:

Unless you specifically changed it, these are found in C:\Program Files\Tesseract-OCR\tessdata (32-bit) and C:\Program Files (x86)\Tesseract-OCR\tessdata (64-bit).

After doing all the above mentioned, you are ready to start OCR’ing.

Optically recognizing characters

As already mentioned, Tesseract is the engine while gImageReader is the GUI; so you don’t have to do anything with Tesseract itself; you use Tesseract through gImageReader.

After you get past the configuration mumbo jumbo mentioned above, you’ll be met with the following:

To start OCR’ing, either hit Open Images and import the images/PDFs you want to OCR…

…or hit Acquire Image to scan in a document:

Once you have images/PDFs loaded into gImageReader, what to do next depends on what type of images/PDFs they are:

  • Mostly Text

On images/PDFs that are made up of mostly text you can do a full recognition. In other words, you don’t need to select any specific text – just hit Recognize all and gImageReader will OCR the whole image/PDF. Take note, however, if you are OCRing a PDF that has multiple pages, Recognize all will only OCR one page at a time. You need to manually do Recognize all on other pages.

  • Text and Images

One of the down sides of Tesseract is it doesn’t recognize pictures embedded with text. So if you have a file that has text and images both, for the best result you need to selectively highlight the text portions and hit the Recognize selection button. Recognize selection only OCRs the part that you have selected.

(Although above I mention using Recognize selection for images/PDFs that contain text and images, if you want to only OCR a specific portion of an all-text document, you can use Recognize selection for that, too.)

Regardless of if you do a full recognition or a selective recognition, once an image/PDF has been OCR’ed, the contents are displayed in a pane to the right:

Once the OCR’ed results are displayed, you can manually edit any mistakes, use a “search and replace” feature, save the output as a text file, or clear the output.

If you loaded multiple images/PDFs into gImageReader all of the files will be listed in a pane on the left…

…and you must conduct OCRs on them each separately. OCR results of multiple documents – or multiple pages of the same document – are displayed one after another in the results pane (the one that appears at the right of the program window after you OCR something) discussed previously.

Conversion Quality

As already mentioned, Tesseract is considered to be one of the best OCR engines out there. The following was the results of two tests I did:

As you can see, for the first test only three minor mistakes were made: “XVW Wrst” instead of “Vista”, “Wrsion” instead of “Version”, and “size:42. OKB zmpped” instead of “size: 42KB zipped”. For the second test the OCR again is very good but “visual design” is converted as “visual d’ esign” (with “esign” on a whole separate line when it shouldn’t be) and “left” is converted as “leli”.

Of course no OCR program is perfect, and Tesseract is no exception: OCR quality will depend on the image/PDF being OCR’ed. The more clear the image/PDF, the better the results will be. However, generally speaking, Tesseract performs brilliantly.

Lastly, take note that Tesseract/gImageReader only support plain text; the OCR’ed text will not be formatted.

Update by Ashraf: For some reason I cannot get Tesseract/gImageReader to work on images. It works beautifully with PDFs but for images it produces garbage output. I am not sure what is up because Locutus doesn’t seem to having this problem. If you have this same problem as me and figure out a solution, please share in the comments below.

Update2 by Ashraf: Apparently Tesseract cannot work with low DPI images, which is why I am having the problem stated in the previous update. Please use higher DPI images (300 DPI seems to be the magic number) if you want good results. Or, alternatively, create a PDF out of your low DPI images and use that; Tesseract works great with PDFs.

Multi-Language Support

Tesseract is designed to work with Unicode, so it works with many different languages. If you want to OCR a language other than English, be sure to install that particular language file during installation of Tesseract:

Currently there are language packs for English, Bulgarian, Catalan, Czech, Chinese (traditional and simplified), Danish, Dutch, German, Greek, Finnish, French, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Serbian, Swedish, Tagalog, Turkish, Ukrainian, and Vietnamese.

Take note that I only tested the program in English; so I don’t know if the superior English quality is the same for other languages.

Annoying aspects of gImageReader

There are two things I found to be annoying about gImageReader:

  • As already mentioned, with gImageReader you must OCR multiple pages of the same document and different images/PDFs separately. I find this to be very annoying. I wish there was the ability to OCR multiple pages and multiple images/PDFs with the click of one button.
  • The other thing I found annoying was the inability to zoom in with the mouse scroll wheel; to zoom in one must click on the zoom buttons. I wish there was a way to use the mouse wheel instead.

Conclusion

OCR is not an easy thing to do. If can afford it and you want to, you could shell out hundreds of dollars for brilliant OCR software (there are many shareware solutions that perform extremely well). Or, if you can’t afford it or if you want to save the money, you can download and use gImageReader/Tesseract. Not only are they free, but they perform extremely well.

You can grab gImageReader and Tesseract from the following links:

Version reviewed: gImageReader v0.8.1 and Tesseract OCR v3.00

Supported OS: Pretty much all Windows for gImageReader and Tesseract OCR; Tesseract OCR also works on Linux and Mac OS X

Download size: gImageReader 16.8 MB and Tesseract OCR 1.8 MB

Malware scan: gImageReader VirusTotal scan (2/43) and Tesseract OCR VirusTotal scan (0/43)

gImageReader homepage [download link]

Tesseract OCR homepage [direct download - Windows version]

Share this post

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

30 comments

  1. Harleen Kaur

    I am using gimagereader to read text from a pdf image. Now all the text gets converted easily but the problem is with the numbers or dates that are there in the pdf document. gimagereader doesnt reads numbers clearly. Can anyone please help me with this.

  2. Robert

    Hi Locutus, I’ve lost 2 hours … I am not able to run gImageRader with SLOVAK lang pack….
    I have sk_SK.aff and sk_SK.dic files in the MySpell/dicts folder. Also en_GB and en_US files are there. Those en_XX are recognized by program without problem. But all cs_xx, sk_xx the progroma doesn’t recognize doesn’t show them for selection.
    Can you please Try it and give me steps which should work properly ?
    Thanx
    R.

  3. Col. Panek

    @Ashraf: I scanned in 200 pages in 72 DPI and found out my OCR wouldn’t read it. I simply resized each page in GIMP to be 300 DPI and then the OCR worked fine! I suppose there’s a batch resizer app somewhere but I was opening up each page anyway to clean them up a little.

    After scanning those dozens of pages (slowly), I’m interested in a DIY scanner: http://www.diybookscanner.org/

  4. Col. Panek

    @Anonur22coolz: If you have a PDF document that consists of text in PDF format, you only need a PDF to text converter, not an OCR. You can even use Open- or LibreOffice to open the PDF (after you install the plugin) and edit the PDF, then save as whatever (or export to PDF again). If you have an image with text you need read and converted to editable text, that’s when you need OCR.

  5. Anonur22coolz

    Hmm…
    Purpose of OCR I think should be to make PDF documents searchable. But this only extracts plain text. :(
    So any other software/workaround to OCR PDFs and safe it as PDF (and not .txt)?
    Would be useful to make big ebooks searchable and less space occupying.

  6. Pkghosh

    Ashraf,

    There’s a newer version of gImageReader and bugfix for Tesseract OCR v3.00. Maybe some of the problems you mentioned are answered. Your review update on these will be valuable.

    Regards,

  7. sandromani

    @lol768:
    The program works natively on linux, you can download deb and rpm packages on sourceforge.

    @ Author: scrolling _does_ work with ctrl+scroll . If you can confirm that this does not work with your setup, could you open a bug on the sourceforge page where I can look at the issue more closely? thanks.

  8. Peter Stern

    @Ashraf: No, absolutely nothing happens except a momentary terminal flash. Even trying to start from an administrative command prompt, entering the command for gImageReader just immediately returns to a prompt. I am thinking of writing a bat file to see if I can get some output on what is wrong.

  9. Ashraf
    Mr. Boss

    @Chuck Wilsker: No, that just means 2 scanners out of 43 thought it was malicious. In other words, it is probably clean.

    @Locutus: Should of mentioned that in the article…

    Anyway that is kind of stupid. 72 DPI is the standard for computer images and Tesseract can’t work with that?!

  10. Ashraf
    Mr. Boss

    @Jonathan Weber: Like Peter I always thought the point of OCR is plain text, so I didn’t include that part in the article above. Updated now. Thanks.

    @Peter Stern: Are you receiving an error when running them?

    @Encarnito: I wouldn’t say completely useless but it does make it a lot less useful.

    @Jerry: What OS are you on?

    @Johan Aa: I am having the same issue! I don’t know what is up… Locutus doesn’t seem to be having the same issue, so we must be doing something wrong.

    @Natalie: Thanks!

    @Locutus: 72 DPI for most of them. However, what does DPI have to do with it? I am not having problems with images that I scan in – it is for other images, like screenshots that I take.

  11. Encarnito

    Hi Ashraf! Thank you very much for this very interesting article! Unfortunately, the fact that there is no option to OCR multiple pages with one click, makes gImageReader completely useless.

  12. Peter Stern

    I installed both applications but neither one will run. I tried double-clicking, right-clicking and starting as Administrator (gImageReader) and starting gImageReader from the run window. I have Vista sp2 32. I did not install in the default directory but in an external drive.

  13. Ashraf
    Mr. Boss

    This works perfectly for PDFs. However, for some reason, I cannot get it to work with images. The result comes out to be very poor even for images that should be very easy to OCR. What am I doing wrong?!