Does tesseract-ocr perform any image pre-processing? - c++

I am currently working with the Tesseract OCR engine and I'm using it in conjunction with OpenCV for pre-processing the image before sending it to the OCR engine. However, I was wondering if Tesseract itself is performing some image pre-processing before extracting the text. If so, what are the methods that Tesseract implements?
My objective is to ensure I don't perform redundant pre-processing methods. Some of the pre-processing methods I perform are adaptiveThreshold and GaussianBlur.
Any help/guidance would be much appreciated!
EDIT:
I understand Tesseract does basic image pre-processing. I would like to know if it is possible to bypass these methods and directly feed in an image that I processed manually. (Again, in order to avoid redundant processing on the image)?

Tesseract uses Leptonica library for doing various pre-processing operations like Otsu binarization algorithm, dilation, erosion and so on.
But because the operations are not dependent to your data, they will cause bad results in some cases.
For more information read this page.

Related

Stitching images can't detect common feature points

I wish to stitch two or more images using OpenCV and C++. The images have regions of overlap but they are not being detected. I tried using homography detector. Can someone please suggest as to what other methods I should use. Also, I wish to use the ORB algorithm, and not SIFT or SURF.
The images can be found at-
https://drive.google.com/open?id=133Nbo46bgwt7Q4IT2RDuPVR67TX9xG6F
This a very common problem. Because images like this, they actually do not have much in common. The overlap region is not rich in feature. What you can do is dig into opencv stitcher code and there they use confidence factor for feature matching, you can play with that confidence factor to get matches in this case. But this will only work if your feature detector is able to detect some features in overlapping resion.
You can also look at this post:
Related Question
It might be helpful for you.
"OpenCV stitching code"
This is full pipleline of OPencv Stitching code. You can see that there are lot of parameters you can change to make your code give some good stitching result. Also I would suggest using a small image (640 X480) for the feature detection step. Using small images is better than using very large images

Sparse coding and dictionary learning using opencv and c++

I am trying to perform text image restoration and I can find no proper documentation on how to perform OMP or K-SVD in C++ using opencv.
I have over 1000 training images of different sizes so do I divide images into equal sized patches or resize all images? How do I construct the signal matrix X?
What other pre-processing steps are required for sparse coding? How to actually perform K-SVD on color images?
What data type is available in OpenCV for an image dictionary and how do I initialize the Dictionary D?
I have these very basic questions and have tried to use various libraries but they don't make the working very clear.
I found this code useful. This is the only implementation in opencv I have come across so far. I guess it uses a single image for dictionary learning whereas I have to use at least 1000 images. But it certainly provides a good guideline.

Restrict Preprocessing of Tesseract

I am new to tesseract library and I set it up on Ubuntu 12.04.
I am using this data set to be recognized. When I was feeding these images to tesseract as it is (without any preprocessing) using this code I was getting 70-75% approx. accuracy.
I want accuracy to be 90+% so I did some preprocessing steps I followed to enhance the image are
Steps for Preprocessing
Applied bottom hat operator with structured element of circle of radius 12
Complement of image to make background white and text as black
Enhance the contrast of resultant image
Erode the image.
after these steps I get pretty clear images can be seen here. But now when I feed these images to tessearct using that same code accuracy get reduced to < 50% I dont know why? Is it because of tesseract do some preprocessing as well? if yes then how can I restrict tesseract from doing that preprocessing. If not then why it is giving me bad results when image is pretty clear now? Pardon me if I have asked some basic question.
Regarding your question why tesseract delivers better results when using a binary image instead of a gray image as input for tesseract:
Tesseract will do an internal binarization of the gray scale image with various methods (haven't figured out right know what method for binarization is used exactly, some times local adaptiv threshold, some times global OTSU threshold is mentioned in the internet). Sure is, that tesseract performs character recognition on a binary image and that the preprocessing of tesseract can still fail at specific problems (hasn't got good layout analyzes for example). So if you do the preprocessing part yourself and give tesseract as input image only a binary image with text and disable all layout analyzes in tesseract you could achieve better results than letting tesseract doing all for you. Since it is an open source free utility, it has some known drawbacks, which has to be accepted.
If you use tesseract as command line tool, this thread is very useful for the parameter. tesseract command line page segmantation
If you use the source code of tesseract in developing your own C++ Code, you have to initialze tesseract with the correct parameter. Parameter are described here at the tesseract API side. tesseract API
Well I was feeding grayscale(8bpp) image to tesseract after preprocessing so after getting that grayscale image tesseract is trying to binarize i.e. convert it to black and white, that was giving me bad results, I still don't know why.
But after that I tried to first convert my scale image in to b/w or 1bpp image and then I fed that image to tesseract I got relatively much better results.

Feature Extraction of a binary image

i am doing OCR project using c++ and opencv. I have some black and white images of separated handwritten characters. I want to extract unique features from those images in order to classify them using LIBSVM. can any one tell me what are the suitable algorithms for feature extraction in opencv?
You can read this. And try this.

C++ Library for image recognition: images containing words to string

Does anyone know of a c++ library for taking an image and performing image recognition on it such that it can find letters based on a given font and/or font height? Even one that doesn't let you select a font would be nice (eg: readLetters(Image image).
I've been looking into this a lot lately. Your best is simply Tesseract. If you need layout analysis on top of the OCR than go with Ocropus (which in turn uses Tesseract to do the OCR). Layout analysis refers to being able to detect position of text on the image and do things like line segmentation, block segmentation, etc.
I've found some really good tips through experimentation with Tesseract that are worth sharing. Basically I had to do a lot of preprocessing for the image.
Upsize/Downsize your input image to 300 dpi.
Remove color from the image. Grey scale is good. I actually used a dither threshold and made my input black and white.
Cut out unnecessary junk from your image.
For all three above I used netbpm (a set of image manipulation tools for unix) to get to point where I was getting pretty much 100 percent accuracy for what I needed.
If you have a highly customized font and go with tesseract alone you have to "Train" the system -- basically you have to feed a bunch of training data. This is well documented on the tesseract-ocr site. You essentially create a new "language" for your font and pass it in with the -l parameter.
The other training mechanism I found was with Ocropus using nueral net (bpnet) training. It requires a lot of input data to build a good statistical model.
In terms of invoking Tesseract/Ocropus are both C++. It won't be as simple as ReadLines(Image) but there is an API you can check out. You can also invoke via command line.
While I cannot recommend one in particular, the term you are looking for is OCR (Optical Character Recognition).
There is tesseract-ocr which is a professional library to do this.
From there web site
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available
I think what you want is Conjecture. Used to be the libgocr project. I haven't used it for a few years but it used to be very reliable if you set up a key.
The Tesseract OCR library gives pretty accurate results, its a C and C++ library.
My initial results were around 80% accurate, but applying pre-processing on the images before supplying in for OCR the results were around 95% accurate.
What is pre-preprocessing:
1) Binarize the bitmap (B&W worked better for me). How it could be done
2) Resampling your image to 300 dpi
3) Save your image in a lossless format, such as LZW TIFF or CCITT Group 4 TIFF.