[Windows] Convert scanned PDFs into text files with PDF OCR

PDF OCRAs you all know, converting scanned documents into electronic files is not the easiest process in the world. There are a few programs out there that claim to do this, but most of them fall well short of completing this goal. PDF OCR promises to be different from the others. So let’s find out how it does!

WHAT IS IT AND WHAT DOES IT DO

Main Functionality

PDF OCR is based on OCR (Optical Character Recognition) technology. The idea is for this program to convert scanned PDF files (paper books, documents, etc.) into editable electronic text files. PDF OCR comes with a build-in text editor, which allows you to edit the OCR results that you get without using MS Word. The program also supports batch mode to convert all pages of a PDF file to text at the same time. The program comes with a Scanned Image To PDF Converter as well. This means you can actually create your own scanned PDF books.

Perfect program for editing PDF files that were created using a Scan-to-PDF function that many scanners offer.

Pros

  • Allows you to extract text from PDF files — quickly converts text in scanned PDF documents into an editable text document
  • Can create PDFs out of image files (but the image -> PDF feature does not support OCR — you need to use PDF -> OCR feature after you create a PDF file if you want OCR)
  • Intuitive interface that is simple for almost everyone to figure out
  • Can be used as a standard PDF viewer
  • When converting PDF -> OCR, you can convert current page only, a range of pages, or whole PDF
  • Supported multiple languages: English, French, German, Fraktur, Italian, Dutch, Spanish, Portuguese, and Basque

Cons

  • Hit-or-miss conversion quality. Had difficulty extracting text from images like the program promises. Only clear and easy-to-read text was converted properly into an editable document, and even that text needs to be proofread, because the program still makes multiple mistakes
  • Multiple language supports seems to be limited to languages that use variations of the English characters (e.g. a, b, c, etc.) and does not include support for languages that use other characters, like Chinese or Arabic.
  • Page breaks do not seem to be recognized during conversion
  • Wants to install into C:\pdfOCR instead of a more proper C:\Program Files\pdfOCR location
  • Extremely resource hungry (at times it was using 80% of my computer’s resources during conversion). With conversion programs, high resource usage isn’t an issue because that just means they are doing their job, but I wanted to mention it anyway.
  • PDF -> OCR feature does not support drag + drop (image -> PDF feature does)
  • No ability to automatically shut down computer after conversion has finished, which would be nice to have seeing as some OCR can take a long time to complete
  • Doesn’t officially list Windows 8 as being support, which is sad seeing as how long Windows 8 has been out already. I don’t have Windows 8 so I didn’t test the program on it.
  • No offline help documentation

Discussion

PDF OCR ScreenshotThere are a few different programs out there that claim to use OCR technology. However, if you have ever tried this technology, you will know that most of them don’t work very well — especially the cheap programs. The whole idea behind this technology is for the program to read non-editable text (whether from a scanned document or from a PDF file).

And that really is what determines the quality of an OCR program; its ability to accurately detect and extract text from images. In the case of PDF OCR, its ability to accurately detect and extract text from PDFs that contain text within images (i.e. scanned PDFs). I found PDF OCR worked better than some other free OCR programs. However, in the end, I am still not sure that the program is worth the price tag.

Let’s start off with what it does right. It has an easy-to-understand interface, so everyone can use it. Also, you can use the program as a standard PDF viewer; however, most of us already have programs for that anyway so this isn’t really a value-added feature especially when you consider the PDF viewing aspect is below par when compared to detected PDF viewers (e.g. no browser plugin). The program will let you create editable text documents from scanned PDF files. If you care converting easy-to-read text, the program works most of the time. It is when you start working with images that things get a bit… odd.

I did four tests with this program. The first test was testing it with clear, machine text. What I did was take a screenshot of a typed paragraph in English and turned that screenshot into a PDF file; since I used a screenshot, the text in the PDF was not native — despite being typed text — and could only be extracted using OCR. My second and third tests were, I repeated this test but with Japanese and German instead of English. Lastly, I tested PDF OCR’s ability to detect and extract handwritten (English) text; wrote something by hand, scanned it, threw it into PDF OCR.

The results? With a scanned PDF containing clear, machine English and German text, PDF OCR performed OK; it wasn’t perfect but it isn’t the worst I’ve seen either. You are going to need to proofread the document that it creates for you, because nine times out of ten, the program makes multiple mistakes. With a scanned PDF containing clear, machine Japanese text and a scanned PDF containing handwritten text, PDF OCR performed terribly… so much so that you would almost be better off manually typing the text yourself if you want it to be editable than using PDF OCR to do OCR. Of course, it should be pointed out the ability to extract handwritten text will vary from handwriting to handwriting but anyone that writes more human-like and less machine-like will have trouble with PDF OCR.

(Note: It appears that PDF OCR uses Tesseract as its engine for OCR. I looked in its Program Files folder and it has a “tessdata” folder with language files for the languages in the Pros list above. Tesseract is a free OCR engine maintained by Google and many freeware OCR program uses it. If PDF OCR uses it, and it looks like it does, I don’t see any reason why anyone would want to pay for PDF OCR when all they are getting is the same engine found in freeware OCR programs.)

CONCLUSION AND DOWNLOAD LINK

If this program was free, I might say it is worth the download. However, with a price tag of $39.95, I just can’t recommend this program to anyone. This is not the worst OCR program I have ever used, but it is far from the best and I wouldn’t recommend anyone to drop $39.95 in return for the mediocre output quality this program offers.

Anyone looking for free OCR solution will be hard pressed to find it simply because good OCR is difficult to do and good OCR programs typically cost a lot of money. The best free OCR program I know of is gImageReader — it uses an open source OCR engine — but even gImageReader has its quirks. If anyone knows of good OCR programs (free or paid), do let us know in the comments below.

Price: Free to try, $39.95 to buy

Version reviewed: 4.3.1

Supported OS: Windows 2000 / XP / 2003 / Vista / 7

Download size: 13.8 MB

VirusTotal malware scan results: 1/46

Is it portable? No

PDF OCR homepage

Share this post

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

31 comments

  1. Louis

    [@CJ Cotter] [@HJB] [@Louis] Thanks CJ Cotter, I received your test document, much obliged. I ran it through all of the above software (didn’t bother with Wondershare Free version), but as per HJB’s request, also ran it through Adobe Acrobat X, and I emailed the converted Word files back to you (all small files), you can have a look, and perhaps compare it to other results you’ve seen on that test doc on other OCR software ?

    From these ones I tested it on, again Aieesoft was the only one that gave meaningful, usable text while still keeping the original format of the document.

    @HJB : The Adobe Acrobat X output not that great — more or less equal to Wondersoft Pro output (both messy, but with some usable text) — Aieesoft streets ahead of both of them, and of the rest.

    I’m keeping that test document (and some others, as I’ve written about earlier) in an “OCR Testing” folder, together with resulting output of different OCR software I come across — I hope to find better quality as time goes on, but for now Aieesoft is my “go-to” software for OCR conversion.

  2. CJ Cotter

    [@Ashraf] I DO have a “test” document that I got from my employer, which I use to try out all OCR software. It is a scanned PDF containing a combination of text and line-graphics. I’ve already sent it to several companies–including Wondershare–to help them test and improve their programs. The few companies that have replied back have told me that the text is “complex”, and therefore, my document cannot be converted. I would gladly send it to you for future OCR tests, if you were interested and told me where to send it.

    [@Mir?e] PDFExchange Viewer Pro did not “convert” my scanned PDF document, but merely created a searchable layer on top, which I’m still unable to edit.

  3. Louis

    I’d like to share my experience which I had last week with PDF to Word OCR software. Over time, I’ve basically collected quite a few free and commercial give-aways, to be used when urgently needed, as happened last week. Having to stand in for another lecturer being hospitalised, to be presenting an International Tax Law lecture on 1 day’s notice, I was finding myself late at night in desperate need for some text material I could put in a PPT for the next day. However, the University could not provide me with a book etc, so I tracked down a student who snapped some smartphone pics of the relevant chapter and emailed it to me later that night.

    The resulted output was of course in .jpg format, hence completely image-based, and although averagely readable, the quality onscreen really wasn’t good.

    So I had to find a way to somehow get those image-based words into text-based words, hence I converted one of these typical looking .jpg pages into a .PDF page using Adobe as a virtual printer. (I don’t have ABBY Screen Reader).

    I have collected on my PC 7 different PDF to MS Word Converter software applications, and decided to first test all 7 on that one page, to see which would give me the best quality usable text.

    The 7 applications are :

    • PDF OCR Version 4
    • Wondershare PDF Converter Free
    • Wondershare PDF Converter Pro
    • Simpo PDF to Word Converter
    • UniPDF – PDF to Word Converter
    • AnyBizSoft PDF to Word Converter
    • Aiseesoft PDF to Word Converter

    My findings :

    Simpo PDF To Word Converter simply froze, and would not even close (granted, this may be a Windows OS problem happened over time).

    PDF OCR Version 4, to be fair, is a PDF to Text converter — it tried, and did output characters that could be copied, to a text file — unfortunately it only succeeded in accurately converting the header of the page, the rest were gibberish.

    All of the rest, except two mentioned below, simply transferred the .jpg image containing the text, and essentially copied it as an image inside the Word document. There was no way to select any text, as it was just one big image inside the Word file. Utterly useless.

    Runner-up : Wondershare PDF Converter Pro

    Apparently better with its OCR engine than the freeware version, this one did in principle at least what is was supposed to do, which is to attempt to convert image characters into text characters, and with formatting transfer it into a Word document. That it did, and all the output in the Word document was in text format, and could be selected and copied. However, it failed quite miserably in the quality of the conversion, only a few words could be identified, the rest were gibberish. Worth keeping as a runner-up, for further in-depth testing may show other strengths

    Winner : Aieesoft PDF To Word Converter :

    It did an almost flawless conversion of the words from the image – it missed none, there were only 6 spelling errors, some of which were minor. The fonts etc were perfectly transferred. Considering that the text on the image were average quality at best, or worse in places, from a clear PDF source one can reasonably expect a solid conversion with few mistakes.

    Conclusion :

    Barring Simpo PDF To Word Converter (which froze and couldn’t work on my PC) and other possible applications not tested here, as well as not comparing it with PDF To Text Converters / Extractors (other than PDF OCR Version 4), nor with ABBY, it will take some doing for any other software to beat Aieesoft PDF To Word Converter.

    Granted this wasn’t a lab test, but it sure as heck was a pressure test ! Good to know which software will come through for you when you really need it, and which will fail you miserably.

    So if Aieesoft PDF To Word Converter comes around again on one of the many give-away sites nowadays, I suggest you grab it and install it.

  4. Petr

    As for PDF to Word conversion (OCR or not), ABBY PDF Transformer definitely does the best job (Transformer + is still better in most features, not in all). But, to get professional results, you must spend time on both pre-editing and post-editing and use the “formated text” mode. Then the converted document is really editable, i.e. even translatable, giving a simple and flexible outcome.

  5. Sys-Eng

    90% of the time, I need to convert a PDF form such as a job application etc.. When doing those type of conversions, the rendering must be at least 99% accurate or it looks like child’s play. The biggest problem is often aligning the text rows because of different size fonts and bordered/shaded blank fields where the input is to be typed or written. It looks like a simple conversion but all of the free programs and free trials I have tried have failed to produce an acceptable document to be edited.

  6. Mir?e

    I wonder that nobody suggested PDFExchange viewer, for me the best pdf-viewer, who supports since version 2.5 (I think) OCR conversion, and has downloadable OCR-dictionaries for quite a number of languages.
    I am quite satisfied with the results, given that it’s free and that even with the top-line products like those from Abby you seldom get perfect results.

  7. Jeanjean

    @ ashraf
    I have one or two old versions obtained free of charge and that I use from time to time.
    Yes, the products are very expansive.
    I pointed out just the existence of the online service, but I’ve never used it.
    I think you just must create an account.

  8. Ashraf
    Author/Mr. Boss

    @Corno: Of course, I agree the same set of PDFs should be used when comparing OCR programs. The thing is, this isnt a comparative review. It is a review of a single soft.
    I think we are all, in one way or another, saying or thinking essentionally the same thing.

  9. Corno

    I think the verb fudge is a bit unluckily chosen -, but HJB is of course correct when he says that a real test should consist of one and the same PDF being converted by various OCR programs and the results being compared. Would be a nice to see the outcome.

  10. Ashraf
    Author/Mr. Boss

    @HJB: It goes without saying how well a document is OCR’ed depends heavily on the document itself, including many of the properties such as clarity that have been mentioned. I didnt fudge anything; we are simplifying the process by referring to it as clear text or not-so-clear-text. It is impossible for us to test every document and, as I already mentioned, results will vary from document to document. We gave our opinion on this soft based on our non-bias testing. If you disagree with us, you are more than welcome to do so.

  11. HJB

    @Ashraf:

    Janet is correct. Ashraf is fudging the issue and is not using terminology that is adequately precise.

    1. Font and text size are superficial to a digital text document, but can be critical in a digital graphic document that you want to OCR.

    2. If a paper document is poorly scanned or is “dirtry”, i.e. marked up, actually dirty or has a lot of lines (such as one produced from excel) or uses a very non-standard font, the error rate is almost certainly going to go up, regardless of what OCR engine is used.

    3. The first test of any OCR engine should be as I have outlined. If it does not pass that test, then it is not worth pursuing how it deals with problem documents.

    4. Once you get to the problem document stage, the only fair way to do it is to take the exact same scan and put it through several OCR engines to see and compare what the results are. Looking only at the error output of only one OCR engine is close to meaningless.

    5. Acrobat has a relatively high end OCR engine. However, if you use the option to convert to a searchable graphic (the graphic looks exactly like the original, format and all, but the searchable text is not visible – it is mapped to the graphic) the result is that you may have a lot of errors that are not apparent — until you do a search, and find that a word that is clearly there does not show up in the search. That is why a Word-like correction process is sorely needed.

    6. An OCR engine that does a good job on the words, but produces poor or unformatted text, may suit your purposes or may be more trouble than it is worth.

  12. Ashraf
    Author/Mr. Boss

    @Janet: We stated in our review “Only clear and easy-to-read text was converted properly into an editable document, and even that text needs to be proofread, because the program still makes multiple mistakes”, which is what I assume you are hinting at. That said, it would be completely irresponsible to only test PDFs with images that have clear text, because that is very simple OCR and even a half assed OCR program can pull text from that. The real test is image PDFs with not-so-clear-text. We tested both.

  13. Janet

    @Ashraf:

    I think HJB’s point was to do the tests on a clear, non-problematic doc. An image file made from a text file is a good basis for comparison because OCR apps are designed to work on (at least) clear text in standard fonts, and different apps will have different problems with images made from e.g., hand printing, or graphics, or a scan of poor quality text such as a newspaper, etc.

  14. HJB

    @RobCr:

    @RobCr

    When it comes to software, most people have four questions:

    1. How well does it work?
    2. What does it cost?
    3. How does it compare to other similar products with regard to 1 and 2.
    4. Is it worthwhile to me for things I want to do?

    A program that does not do the job that meets your needs is just a time waster for you even if it is free.

    As to OCR and pdfs, the subject is more complicated than most blog discussions indicate. For example, Adobe has several kinds of OCR that the user can opt to use. Each has its place and provides benefits that are appropriate to different needs. The major weaknesses are either that errors are hidden behind a searchable graphic image or that correcting an error, within a pdf is anywhere from difficult to impossible.

    Someone ought produce a pdf text editor that works something like MS Word’s correction processes.

  15. HJB

    I find Adobe Acrobat OCR works quite well. Even the best OCR engines are dependent upon the graphic quality of the material they are given.

    Generalizations based upon comparing apples and oranges have no validity and are misleading.

    A real test of OCR would follow these steps:

    1. Create a pdf from a text document, such as Word or plaint text.

    2. Save the pdf pages as a graphic – using Acrobat or other reliable conversion application.

    3. Feed the graphic to the OCR engine.

    4. Compare 1 to the result of 3, either by visual examination or by feeding it to MS Word’s version comparison.

  16. Corno

    It is only recently that PDF converters with OCR facilities are being given away. Recently Aiseesoft PDF2Word was given away at GOTD, and truth be told it produced conversions that were outstanding (better in fact than the OCR conversions done by the paid Wondershare Pro PDFcpnverter). What bothers me is that there is a shady side to Aiseesoft and some other obscure vendors that are flooding the market with PDF convertors these days. Remember, PDF is a weak spot in your PC’s line of defense and the prospect of becoming part of a totalitarian state’s botnet does not exactly appeal to me.