- dotTech - http://dottech.org -
[Windows] Convert scanned PDFs into text files with PDF OCR
Posted By Ashraf On November 22, 2013 @ 12:00 AM In Windows | 31 Comments
As you all know, converting scanned documents into electronic files is not the easiest process in the world. There are a few programs out there that claim to do this, but most of them fall well short of completing this goal. PDF OCR promises to be different from the others. So let’s find out how it does!
PDF OCR is based on OCR (Optical Character Recognition) technology. The idea is for this program to convert scanned PDF files (paper books, documents, etc.) into editable electronic text files. PDF OCR comes with a build-in text editor, which allows you to edit the OCR results that you get without using MS Word. The program also supports batch mode to convert all pages of a PDF file to text at the same time. The program comes with a Scanned Image To PDF Converter as well. This means you can actually create your own scanned PDF books.
Perfect program for editing PDF files that were created using a Scan-to-PDF function that many scanners offer.
There are a few different programs out there that claim to use OCR technology. However, if you have ever tried this technology, you will know that most of them don’t work very well — especially the cheap programs. The whole idea behind this technology is for the program to read non-editable text (whether from a scanned document or from a PDF file).
And that really is what determines the quality of an OCR program; its ability to accurately detect and extract text from images. In the case of PDF OCR, its ability to accurately detect and extract text from PDFs that contain text within images (i.e. scanned PDFs). I found PDF OCR worked better than some other free OCR programs. However, in the end, I am still not sure that the program is worth the price tag.
Let’s start off with what it does right. It has an easy-to-understand interface, so everyone can use it. Also, you can use the program as a standard PDF viewer; however, most of us already have programs for that anyway so this isn’t really a value-added feature especially when you consider the PDF viewing aspect is below par when compared to detected PDF viewers (e.g. no browser plugin). The program will let you create editable text documents from scanned PDF files. If you care converting easy-to-read text, the program works most of the time. It is when you start working with images that things get a bit… odd.
I did four tests with this program. The first test was testing it with clear, machine text. What I did was take a screenshot of a typed paragraph in English and turned that screenshot into a PDF file; since I used a screenshot, the text in the PDF was not native — despite being typed text — and could only be extracted using OCR. My second and third tests were, I repeated this test but with Japanese and German instead of English. Lastly, I tested PDF OCR’s ability to detect and extract handwritten (English) text; wrote something by hand, scanned it, threw it into PDF OCR.
The results? With a scanned PDF containing clear, machine English and German text, PDF OCR performed OK; it wasn’t perfect but it isn’t the worst I’ve seen either. You are going to need to proofread the document that it creates for you, because nine times out of ten, the program makes multiple mistakes. With a scanned PDF containing clear, machine Japanese text and a scanned PDF containing handwritten text, PDF OCR performed terribly… so much so that you would almost be better off manually typing the text yourself if you want it to be editable than using PDF OCR to do OCR. Of course, it should be pointed out the ability to extract handwritten text will vary from handwriting to handwriting but anyone that writes more human-like and less machine-like will have trouble with PDF OCR.
(Note: It appears that PDF OCR uses Tesseract as its engine for OCR. I looked in its Program Files folder and it has a “tessdata” folder with language files for the languages in the Pros list above. Tesseract is a free OCR engine maintained by Google and many freeware OCR program uses it. If PDF OCR uses it, and it looks like it does, I don’t see any reason why anyone would want to pay for PDF OCR when all they are getting is the same engine found in freeware OCR programs.)
If this program was free, I might say it is worth the download. However, with a price tag of $39.95, I just can’t recommend this program to anyone. This is not the worst OCR program I have ever used, but it is far from the best and I wouldn’t recommend anyone to drop $39.95 in return for the mediocre output quality this program offers.
Anyone looking for free OCR solution will be hard pressed to find it simply because good OCR is difficult to do and good OCR programs typically cost a lot of money. The best free OCR program I know of is gImageReader  — it uses an open source OCR engine — but even gImageReader has its quirks. If anyone knows of good OCR programs (free or paid), do let us know in the comments  below.
Price: Free to try, $39.95 to buy
Version reviewed: 4.3.1
Supported OS: Windows 2000 / XP / 2003 / Vista / 7
Download size: 13.8 MB
VirusTotal malware scan results: 1/46 
Is it portable? No
PDF OCR homepage 
Article printed from dotTech: http://dottech.org
URL to article: http://dottech.org/91691/windows-review-pdf-ocr/
URLs in this post:
 Image: http://dottech.org/wp-content/uploads/2013/01/PDF-OCR.png
 Image: http://dottech.org/wp-content/uploads/2013/01/PDF-OCR-Screenshot.png
 gImageReader: http://dottech.org/21372/gimagereader-open-source-google-powered-ocr-optical-character-recognition-program-that-actually-works/
 comments: #comments
 1/46: https://www.virustotal.com/file/16de9d37b85757e0747779ea972ad06657b698395e6a17441c49daa65ec27134/analysis/
 PDF OCR homepage: http://www.pdfocr.net/
© 2008-2012 dotTech.org | All content is the property of its rightful owner.