OCR in opencv - how to pass objects - c++

I'd like to write OCR in OpenCV. I need to recognize single letters. I'd like to use K-Nearest Neighbors. I'd like to recognize letters with different size and font and handwritten.
So, I'll prepare images to train. The first question is. Should I use letters in the (1) same size of images or (2) fit image?
1)
2)
How about found letters? Should I pass it as 1 (with the same size as train images) or 2 (just fit rectangle to letter)???

The "benchmark" MNIST dataset normalizes and centers the characters as in scenario (1) you described. If you're just interested in classification, it might make any difference how you do it.
If I understand you correctly, your second question has to do with what's called "preprocessing" in ML jargon. If you apply a transformation to convert each raw image into one of type either (1) or (2), it's called a preprocessing step -- which ever one you choose. Whatever preprocessing you do to the training set, the exact same preprocessing has to be done to the data before applying the model.
To make it simple, if you have a giant data set that you want to split into "training" and "testing" examples, first transform this into a "preprocessed data" set, and split this one. That way you're sure the exact same transformation parameters are used for both training and testing.

Related

Choose best input to OCR from two images

I'm building a pre-processing project to enhance the OCR results that will happen in stage 2, but from two images.
For example I have image1, and image2, and we need to check which one is better to do the OCR on it.
The performance and process time is so important (real-time application).
Here are some cases I need to discuss about:
Case1:
Both are "F" letter, but the first one is readable "F" in the OCR that will happen next, where the second one is not readable at all, so for case 1 I need to choose the first "F" as an input for the OCR and ignore the second image.
Case2:
Both are "R" letter, and both are readable in the OCR, but the first one is better from the second one as we see, so I need to choose the first "R" here.
Case3:
It's similar to first case, where the "n" here is not readable in the OCR so I need to choose the first "na"
Case4:
In the first "na" the "n" and "a" are not merged together, where the second one they are one "contour" so the first "na" is much better to be the input to the OCR.
I need to build a generic fast algorithm to check if this part of the image is better for the OCR or not.
I tried the following:
1- Method1: Check if the image is blurry or not, and choose the one which is better.
2- Method2: called canny method (or sobel) and choose the image that's better.
3- Method2: check count of contours on both the image and choose the one that's seems better according to contours area, and counts.
Any better suggestions?
The question is how do you get these chunks?
How do you decide that these are the things to compare.
If you know the sizes of expected characters - so you cut the chunks at that size - then you can use that as well in the comparison.
I think this question is a bit too broad. For different types of defects I would suggest different approaches.
For the first 2 cases and possibly the third, I would suggest doing a small morphological close and comparing the result with the original. The better letter would change less using close since it doesn't have small holes and dots. The comparison metric can be as simple as a sum over a pixelwise absolute difference.
You can try morphological operations.Have a look on this.
http://docs.opencv.org/master/d9/d61/tutorial_py_morphological_ops.html#gsc.tab=0

How to train a svm for classifying images of english alphabet?

My objective is to detected text in an image and recognize them.
I have achieved detecting characters using stroke width transform.
What to do to recognize them?
As per my knowledge, I thought of training the svm with my dataset of letters of different fonts[images] by detecting feature point and extracting feature vectors from each and every image.[I have used SIFT Feature vector,did build the dictionary using kmean clusetering and all].
I have detected a character before, i will extract the sift feature vector for this character . and i thought of feeding this into the svm prediction function.
I dont know how to recognize using svm. I am confused! Help me and correct me where ever I went wrong with concept..
I followed this turorial for recognizing part. Can this turotial can be applicable to recognize characters.
http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
SVM is a supervised classifier. To use it, you will need to have training data that is of the type of objects you are trying to recognize.
Step 1 - Prepare training data
The training data consists of pairs of feature vectors and their corresponding class labels. In your case, it appears that you have extracted a SIFT-based "Bag-of-word" (BOW) feature vector for the characters you detected. So, for your training data, you will need to find many examples of the different characters, extract this feature vector for each of them, and associate them with a label (sometimes called a class label, and typically an integer) which you will perhaps map to a textual description (for e.g., the number 0 could be mapped to the character 'a', and so on.)
Step 2 - Training the classifier
The SVM classifier takes in as input an array/Mat of feature vectors (one per row) and their associated labels. Tune the parameters of the SVM (i.e., the regularization parameter C, and if applicable, any other parameters for kernels) on a separate validation set.
Step 3 - Predict for unseen data
At test time, given a sample that was not seen by the SVM during training, you compute a feature vector (your SIFT-based BOW vector) for the sample. Pass this feature vector to the SVM's predict function, and it will return you an integer. Remember earlier when preparing your training data, you have associated an integer with each label? This is the label predicted by the SVM for this sample. You can then map this label to a character. For e.g., if you have associated 0 with 'a', 1 with 'b' etc., you can use a vector/hashmap to map the integer to its textual counterpart.
Additional Notes
You can check out OpenCV's SVM tutorial here for details.
NOTE: Often, for beginners, the hardest part (after getting the data) is tuning the classifier. My advice is first try a simple classifier (for e.g., a linear SVM) which has few parameters to tune. A decent one would be the linear SVM, which only requires you to adjust one parameter C. Once you manage to get somewhat decent results (which gives some assurance that the rest of your code is working) you can move on to more "sophisticated" classifiers.
Lastly, the training data and feature vectors you extract are very important. The training data must be "similar" to the test data you are trying to predict. For e.g., if you are predicting characters found in road signs which comes with different fonts, lighting conditions, and pose differences, then using training data consisting of characters taken from say a newspaper/book archive may not give you good results. This is an issue of domain adaptation in machine learning.

Improve Tesseract detection quality

I am trying to extract alphanumeric characters (a-z0-9) which do not form sensefull words from an image which is taken with a consumer camera (including mobile phones). The characters have equal size and font type and are not formated. The actual processing is done under Windows.
The following image shows the raw input:
After perspective processing I apply the following with OpenCV:
Convert from RGB to gray
Apply cv::medianBlur to remove noise
Convert the image to binary using adaptive thresholding cv::adaptiveThreshold
I know the number of rows and columns of the grid. Thus I simply extract each grid cell using this information.
After all these steps I get images which look similar to these:
Then I run tesseract (latest SVN version with latest training data) on each extracted cell image individually (I tried different -psm and -l values):
tesseract.exe -l eng -psm 11 sample.png outtext
The results produced by tesseract are not very good:
Most characters are not recognized.
The grid lines are sometimes interpreted as "l" or "i" characters.
I already experimented with morphologic operations (open, close, erode, dilate) and replaced adaptive thresholding with OTSU thresholding (THRESH_OTSU) but the results got worse.
What else could I try to improve the recognition quality? Or is there even a better method to extract the characters besides using tesseract (for instance template matching?)?
Edit (21-12-2014):
I tested simple template matching (using normalized cross correlation and LMS but with even worse results). But I have made a huge step forward by extracting each character using findCountours and then running tesseract with only one character and the -psm 10 option which interprets each input image as a single character. Additonaly I remove non-alphanumeric characters in a post processing step. The first results are encouraging with detection rates of 90% and better. The main problem are misdetections of "9" and "g" and "q" characters.
Regards,
As I say here, you can tell tesseract to pay attention on "almost same" characters.
Also, there is some option in tesseract that don't help you in your example.
For instance, a "Pocahonta5S" will become, most of the time, a "PocahontaSS" because the number is in a letter word. You can see in this way so.
Concerning pre-processing, you better have to use a sharpen filter.
Don't forget that tesseract will always apply an Otsu's filter before reading anything.
If you want good result, sharpening + Adaptive Threshold with some other filters are good ideas.
I recommend to use OpenCV in Combination with tesseract.
The problem in your input images for tesseract are the non-character regions in your image.
An approach myself
To get rid of these I would use the openCV findContour function to receive all contours in your binary image. Afterwards define some criteria to illiminate the non-character regions. For example only take the regions, which are inside the image and doesn't touch the border, or to only take the regions with a specific region-area or a specific ratio of heigth to width. Find some kind of features, that let you distinguish between character an non-character contours.
Afterwards eliminate these non-character regions and handle the images forward to tesseract.
Just as idea for general testing this approach:
Eliminate the non-character regions manual (gimp or paint,...) and give the image to tesseract. If the result fits your expactations you can try to eliminate the the non-character regions with proposed method of above.
I suggest a similar approach I'm using in my case.
(I only have the problem of speed, which you should not have if its only some characters to compare)
First: Get the form to have default size and transform it:
https://www.youtube.com/watch?v=W9oRTI6mLnU
Second: Use matchTemplate
Improve template matching with many templates for one Image/ find characters on image
I also played around with OCR but I didn't like it because of 2 reasons:
Some kind of blackbox and hard to debug why its not recognized
In my case it was never 100% accurate no matter what i did even for screenshots with "perfect" characters.

How to remove false matches from FLANNBASED Matcher in OPENCV?

[Request you to read question details before marking it duplicate or down-voting it. I have searched thoroughly and couldn't find a solution and hence posting the question here.]
I am trying to compare one image with multiple images and get a list of ALL matching images. I do NOT want to draw keypoints between images.
My solution is based on the following source code:
https://github.com/Itseez/opencv/blob/master/samples/cpp/matching_to_many_images.cpp
The above source code matches one image with multiple images and get best matching image.
I have modified the above sample and generated:
vector<vector<DMatch>> matches;
vector<vector<DMatch>> good_matches;
Now my question is how do I apply nearest neighbor search ratio to get good matches for multiple images?
Edit 1:
My implementation is as follows:
For each image in the data-set, compute SURF descriptors.
Combine all the descriptors into one big matrix.
Build a FLANN index from the concatenated matrix.
Compute descriptors for the query image.
Run KNN search over the FLANN index to find top 20 or less best matching image. K is set as 20.
Filter out all the inadequate matches computed in the previous step. (How??)
I have successfully done steps number 1 to 5. I am facing problem in step number 6 wherein I am not able to remove false matches.
There are two answers to your problem. The first is that you should be using a completely different technique, the second answer is how to actually do what you asked for.
Use a different method
You want to find duplicates of a given query image. Traditionally, you do this by comparing global image descriptors, not local feature descriptors.
The simplest way to do this would be by aggregating the local feature descriptors into a local descriptor. The standard method here is "bag of visual words". In OpenCV this is called Bag-Of-Words (like BOWTrainer, BOWImgDescriptorExtractor etc). Have a look at the documentation for using this.
There is some example code in samples/cpp/bagofwords_classification.cpp
The benefits will be that you get more robust results (depending on the implementation of what you are doing now), and that the matching is generally faster.
Use your method
I understand that you want to remove points from the input that lead to false positives in your matching.
You can't remove points from FLANN(1, 2, 3). FLANN builds a tree for fast search. Depending on the type of tree, removing a node becomes impossible. Guess what, FLANN uses a KD-tree which doesn't (easily) allow removal of points.
FlannBasedMatcher does not support masking permissible matches of descriptor sets because flann::Index does not support this.
I would suggest using a radius search instead of a plain search. Alternatively look at the L2-distance of the found matches and in your code write a function to see whether the distance falls below a threshold.
Edit
I should also note that you can rebuild your flann-tree. Obviously, there is a performance penalty when doing this. But if you have a large number of queries and some features coming up as false-positives way too often, it might make sense to do this once.
You need the functions DescriptorMatcher::clear() and then DescriptorMatcher::add(const vector<Mat>& descriptors) for this. Referenz.
Firstly you need to define "inadequate matches", without this definition you can't do anything.
There are several loose definitions that spring to mind:
1: inadequate matches = match with distance > pre-defined distance
In this case a flann radius search may more appropriate as it will only give you those indexes within the predefined radius from the target:
http://docs.opencv.org/trunk/modules/flann/doc/flann_fast_approximate_nearest_neighbor_search.html#flann-index-t-radiussearch
2: inadequate matches = match with distance > dynamically defined distance based on the retrieved k-nn
This is more tricky and off the top of my head i can think of two possible solutions:
2a: Define some ratio test based on the distance the first 1-NN such as:
base distance = distance to 1NN
inadequate match_k = match distance_k >= a * base distance;
2b: Use a dynamic thresholding technique such as the Otsu threshold on the normalized distribution of distances for the k-nn, thus partitioning the the k-nn into two groups, the group that contains the 1-nn is the adequate group, the other is the inadequate group.
http://en.wikipedia.org/wiki/Otsu's_method,
http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html#threshold.

What is `query` and `train` in openCV features2D

Everywhere in features2D classes I see terms query and train. For example matches have trainIdx and queryIdx, and Matchers have train() method.
I know the definition of words train and query in English, but I can't understand the meaning of this properties or methods.
P.S. I understand, that it's very silly question, but maybe it's because English is not my native language.
To complete sansuiso's answer, I suppose the reason for choosing these names should be that in some application we have got a set of images (training images) beforehand, for example 10 images taken inside your office. The features can be extracted and the feature descriptors can be computed for these images. And at run-time an image is given to the system to query the trained database. Hence the query image refers to this image. I really don't like the way they have named these parameters. Where you have a pair of stereo images and you want to match the features, these names don't make sense but you have to chose a convention say always call the left image the query image and the right image as the training image. I did my PhD in computer vision and some naming conventions in OpenCV seem really confusing/silly to me. So if you find these confusing or silly you're not alone.
train: this function builds the classifier inner state in order to make it operational. For example, think of training an SVM, or building a kd-tree from the reference data. Maybe you are confused because this step is often referred to as learning in the literature.
query is the action of finding the nearest neighbors to a set of points, and by extension it also refers to the whole set of points for which yo want a nearest neighbor. Recall that you can ask for the neighbors of 1 point, or a whole lot in the same function call (by stacking the feature points in a matrix).
trainIdxand queryIdx refer to the index of a pint in the reference / query set respectively, i.e. you ask the matcher for the nearest point (stored at the trainIdx position) to some other point (stored at the queryIdxposition). Of course, trainIdxis known after the function call. If your points are stored in a matrix, the index will be the line of the considered feature.
I understand "query" and "train" in a very naive but useful way:
"train": a data or image is preprocessed to get a database
"query": an input data or image that will be queried in the database which we trained before.
Hope it helps u as well.