[Windows] Convert scanned PDF books and documents into electronic text files with PDF OCR

January 2, 2013 15 Email article | Print article

PDF OCRAs you all know, converting scanned documents into electronic files is not the easiest process in the world. There are a few programs out there that claim to do this, but most of them fall well short of completing this goal. PDF OCR promises to be different from the others. So let’s find out how it does!

WHAT IS IT AND WHAT DOES IT DO

Main Functionality

PDF OCR is based on OCR (Optical Character Recognition) technology. The idea is for this program to convert scanned PDF files (paper books, documents, etc.) into editable electronic text files. PDF OCR comes with a build-in text editor, which allows you to edit the OCR results that you get without using MS Word. The program also supports batch mode to convert all pages of a PDF file to text at the same time. The program comes with a Scanned Image To PDF Converter as well. This means you can actually create your own scanned PDF books.

Perfect program for editing PDF files that were created using a Scan-to-PDF function that many scanners offer.

Pros

  • Intuitive interface that is simple for almost everyone to figure out
  • Can be used as a standard PDF viewer
  • Allows you to extract text from PDF files — quickly converts text in a PDF document into an editable text document

Cons

  • Page breaks do not seem to be recognized during conversion
  • Hit-or-miss conversion quality. Had difficulty extracting text from images like the program promises. Only clear and easy-to-read text was converted properly into an editable document, and even that text needs to be proofread, because the program still makes multiple mistakes
  • Extremely resource hungry (at times it was using 80% of my computer’s resources during conversion)

Discussion

PDF OCR ScreenshotThere are a few different programs out there that claim to use OCR technology. However, if you have ever tried this technology, you will know that most of them don’t work very well. The whole idea behind this technology is for the program to read non-editable text (whether from a scanned document or from a PDF file).

The idea behind this technology works a lot better than it is actually executed. That being said, I did find that PDF OCR worked better than some other free OCR programs. However, in the end, I am still not sure that the program is worth the price tag.

Let’s start off with what it does right. It has an easy-to-understand interface, so everyone can use it. Also, you can use the program as a standard PDF viewer; however, most of us already have programs for that anyway. The program will let you create editable text documents from scanned documents and PDF files. If you care converting easy-to-read text, the program works most of the time. It is when you start working with images that things get a bit…odd.

Now let’s get to the problems I had with the program. As I just talked about, the program works alright when dealing with just text. However, you are still going to need to proofread the document that it creates for you. Nine times out of ten, you are still going to find a few mistakes. If you want to pull text from images…you might as well forget about it. Sure the program says it can, but I was not able to pull text from any image and it come out readable. Maybe I was using the wrong kind of pictures, but I if I purchased this program I don’t want to be limited to what pictures it will actually be able to pull text from.

I wish the bad news ended there, but it doesn’t. When this program is converting anything, it becomes extremely resource hungry. As in, it was using over 80% of my computer’s resources during a standard conversion. Not only that, but the program does not seem to read or understand page breaks at all. This causes a lot of format errors even when no images are put into the mix.

CONCLUSION AND DOWNLOAD LINK

If this program was free, I might say it is worth the download. However, with a price tag of $39.95 (and that is on sale mind you — regular price is higher), I just can’t recommend this program to anyone. This is not the worst OCR program I have ever used, but it is far from the best and I wouldn’t recommend anyone to drop $39.95 in return for the mediocre output quality this program offers.

Anyone looking for free OCR solution will be hard pressed to find it simply because good OCR is difficult to do and good OCR programs typically cost a lot of money. The best free OCR program I know of is gImageReader — it uses an open source OCR engine — but even gImageReader has its quirks. If anyone knows of good OCR programs (free or paid), do let us know in the comments below.

Price: Free to try, $39.95 to buy

Version reviewed: 4.1

Supported OS: Windows 2000 / XP / 2003 / Vista / 7

Download size: 13.7MB

VirusTotal malware scan results: 1/46

Is it portable? No

PDF OCR homepage

15 Comments »

  1. Robert January 2, 2013 at 12:42 AM (comment permalink) -

    Right now on http://www.giveawayoftheday.com you can get PDF OCR 4.3 FREE.

    1
  2. Corno January 2, 2013 at 1:45 AM (comment permalink) -

    It is only recently that PDF converters with OCR facilities are being given away. Recently Aiseesoft PDF2Word was given away at GOTD, and truth be told it produced conversions that were outstanding (better in fact than the OCR conversions done by the paid Wondershare Pro PDFcpnverter). What bothers me is that there is a shady side to Aiseesoft and some other obscure vendors that are flooding the market with PDF convertors these days. Remember, PDF is a weak spot in your PC’s line of defense and the prospect of becoming part of a totalitarian state’s botnet does not exactly appeal to me.

    2
  3. Corno January 2, 2013 at 1:58 AM (comment permalink) -

    Ashraf, I think you should have compared conversions made from scanned PDFs with conversions made from PDF that were initially produced as text files.

    3
  4. Shawn January 2, 2013 at 4:39 AM (comment permalink) -

    One of my favorites available at SourceForge

    http://capture2text.sourceforge.net/

    And yes it’s portable.

    4
  5. HJB January 2, 2013 at 6:43 AM (comment permalink) -

    I find Adobe Acrobat OCR works quite well. Even the best OCR engines are dependent upon the graphic quality of the material they are given.

    Generalizations based upon comparing apples and oranges have no validity and are misleading.

    A real test of OCR would follow these steps:

    1. Create a pdf from a text document, such as Word or plaint text.

    2. Save the pdf pages as a graphic – using Acrobat or other reliable conversion application.

    3. Feed the graphic to the OCR engine.

    4. Compare 1 to the result of 3, either by visual examination or by feeding it to MS Word’s version comparison.

    5
  6. Ashraf January 2, 2013 at 6:52 AM (comment permalink) -
    Mr. Boss

    @Corno: What would be the point in that? There is no need for OCR with text PDFs. This program is not for text PDFs.

    6
  7. RobCr January 2, 2013 at 7:07 AM (comment permalink) -

    @HJB:
    Many DotTechies could have this question on their lips -
    Is it free ?

    7
  8. HJB January 2, 2013 at 7:21 AM (comment permalink) -

    @RobCr:

    @RobCr

    When it comes to software, most people have four questions:

    1. How well does it work?
    2. What does it cost?
    3. How does it compare to other similar products with regard to 1 and 2.
    4. Is it worthwhile to me for things I want to do?

    A program that does not do the job that meets your needs is just a time waster for you even if it is free.

    As to OCR and pdfs, the subject is more complicated than most blog discussions indicate. For example, Adobe has several kinds of OCR that the user can opt to use. Each has its place and provides benefits that are appropriate to different needs. The major weaknesses are either that errors are hidden behind a searchable graphic image or that correcting an error, within a pdf is anywhere from difficult to impossible.

    Someone ought produce a pdf text editor that works something like MS Word’s correction processes.

    8
  9. Janet January 2, 2013 at 8:09 AM (comment permalink) -

    @Ashraf:

    I think HJB’s point was to do the tests on a clear, non-problematic doc. An image file made from a text file is a good basis for comparison because OCR apps are designed to work on (at least) clear text in standard fonts, and different apps will have different problems with images made from e.g., hand printing, or graphics, or a scan of poor quality text such as a newspaper, etc.

    9
  10. Ashraf January 2, 2013 at 9:29 AM (comment permalink) -
    Mr. Boss

    @Janet: We stated in our review “Only clear and easy-to-read text was converted properly into an editable document, and even that text needs to be proofread, because the program still makes multiple mistakes”, which is what I assume you are hinting at. That said, it would be completely irresponsible to only test PDFs with images that have clear text, because that is very simple OCR and even a half assed OCR program can pull text from that. The real test is image PDFs with not-so-clear-text. We tested both.

    10
  11. HJB January 2, 2013 at 9:55 AM (comment permalink) -

    @Ashraf:

    Janet is correct. Ashraf is fudging the issue and is not using terminology that is adequately precise.

    1. Font and text size are superficial to a digital text document, but can be critical in a digital graphic document that you want to OCR.

    2. If a paper document is poorly scanned or is “dirtry”, i.e. marked up, actually dirty or has a lot of lines (such as one produced from excel) or uses a very non-standard font, the error rate is almost certainly going to go up, regardless of what OCR engine is used.

    3. The first test of any OCR engine should be as I have outlined. If it does not pass that test, then it is not worth pursuing how it deals with problem documents.

    4. Once you get to the problem document stage, the only fair way to do it is to take the exact same scan and put it through several OCR engines to see and compare what the results are. Looking only at the error output of only one OCR engine is close to meaningless.

    5. Acrobat has a relatively high end OCR engine. However, if you use the option to convert to a searchable graphic (the graphic looks exactly like the original, format and all, but the searchable text is not visible – it is mapped to the graphic) the result is that you may have a lot of errors that are not apparent — until you do a search, and find that a word that is clearly there does not show up in the search. That is why a Word-like correction process is sorely needed.

    6. An OCR engine that does a good job on the words, but produces poor or unformatted text, may suit your purposes or may be more trouble than it is worth.

    11
  12. Ashraf January 2, 2013 at 10:29 AM (comment permalink) -
    Mr. Boss

    @HJB: It goes without saying how well a document is OCR’ed depends heavily on the document itself, including many of the properties such as clarity that have been mentioned. I didnt fudge anything; we are simplifying the process by referring to it as clear text or not-so-clear-text. It is impossible for us to test every document and, as I already mentioned, results will vary from document to document. We gave our opinion on this soft based on our non-bias testing. If you disagree with us, you are more than welcome to do so.

    12
  13. Corno January 2, 2013 at 2:30 PM (comment permalink) -

    I think the verb fudge is a bit unluckily chosen -, but HJB is of course correct when he says that a real test should consist of one and the same PDF being converted by various OCR programs and the results being compared. Would be a nice to see the outcome.

    13
  14. Ashraf January 2, 2013 at 2:33 PM (comment permalink) -
    Mr. Boss

    @Corno: Of course, I agree the same set of PDFs should be used when comparing OCR programs. The thing is, this isnt a comparative review. It is a review of a single soft.
    I think we are all, in one way or another, saying or thinking essentionally the same thing.

    14
  15. Janet January 2, 2013 at 2:53 PM (comment permalink) -

    Since we just recently got/discussed Aiseesoft PDF to Word Converter, I would imagine some dottechers must have compared results of that and today’s PDF OCR on conversion results…?

    15

Leave A Response »