Measure of accuracy in pattern recognition using SURF in OpenCV

Measure of accuracy in pattern recognition using SURF in OpenCV - c++

I’m currently working on pattern recognition using SURF in OpenCV. What do I have so far: I’ve written a program in C# where I can select a source-image and a template which I want to find. After that I transfer both pictures into a C++-dll where I’ve implemented a program using the OpenCV-SURFdetector, which returns all the keypoints and matches back to my C#-program where I try to draw a rectangle around my matches.
Now my question: Is there a common measure of accuracy in pattern recognition? Like for example number of matches in proportion to the number of keypoints in the template? Or maybe the size-difference between my match-rectangle and the original size of the template-image? What are common parameters that are used to say if a match is a “real” and “good” match?
Edit: To make my question clearer. I have a bunch of matchpoints, that are already thresholded by minHessian and distance value. After that I draw something like a rectangle around my matchpoints as you can see in my picture. This is my MATCH. How can I tell now how good this match is? I'm already calculating angle, size and color differences between my now found match and my template. But I think that is much too vague.

I am not 100% sure about what you are really asking, because what you call a "match" is vague. But since you said you already matched your SURF points and mentionned pattern recognition and the use of a template, I am assuming that, ultimately, you want to localize the template in your image and you are asking about a localization score to decide whether you found the template in the image or not.
This is a challenging problem and I am not aware that a good and always-appropriate solution has been found yet.
However, given your approach, what you could do is analyze the density of matched points in your image: consider local or global maximas as possible locations for your template (global if you know your template appears only once in the image, local if it can appear multiple times) and use a threshold on the density to decide whether or not the template appears. A sketch of the algorithm could be something like this:
Allocate a floating point density map of the size of your image
Compute the density map, by increasing by a fixed amount the density map in the neighborhood of each matched point (for instance, for each matched point, add a fixed value epsilon in the rectangle your are displaying in your question)
Find the global or local maximas of the density map (global can be found using opencv function MinMaxLoc, and local maximas can be found using morpho maths, e.g. How can I find local maxima in an image in MATLAB?)
For each maxima obtained, compare the corresponding density value to a threshold tau, to decide whether your template is there or not
If you are into resarch articles, you can check the following ones for improvement of this basic algorithm:
"ADABOOST WITH KEYPOINT PRESENCE FEATURES FOR REAL-TIME VEHICLE VISUAL DETECTION", by T.Bdiri, F.Moutarde, N.Bourdis and B.Steux, 2009.
"Interleaving Object Categorization and Segmentation", by B.Leibe and B.Schiele, 2006.
EDIT: another way to address your problem is to try and remove accidently-matched points in order to keep only those truly corresponding to your template image. This can be done by enforcing a constraint of consistancy between close matched points. The following research article presents an approach like this: "Context-dependent logo matching and retrieval", by H.Sahbi, L.Ballan, G.Serra, A.Del Bimbo, 2010 (however, this may require some background knowledge...).
Hope this helps.

Well, when you compare points you use some metric.
So results or comparison have some resulting distance.
And the less this distance is the better.
Example of code:
BFMatcher matcher(NORM_L2,true);
vector<DMatch> matches;
matcher.match(descriptors1, descriptors2, matches);
matches.erase(std::remove_if(matches.begin(),matches.end(),bad_dist),matches.end());
where bad_dist is defined as
bool dist(const DMatch &m) {
return m.distance > 150;
}
In this code i get rid of 'bad' matches.

There are many ways to match two patterns in the same image, actually it's a very open topic in computer vision, because there isn't a global best solution.
For instance, if you know your object can appear rotated (I'm not familiar with SURF, but I guess the descriptors are rotation invariant like SIFT descriptors), you can estimate the rotation between the pattern you have in the training set and the pattern you just matched with. A match with the minimum error will be a better match.
I recommend you consult Computer Vision: Algorithms and Applications. There's no code in it, but lots of useful techniques typically used in computer vision (most of them already implemented in opencv).

Related

OpenCV edge based object detection C++

I have an application where I have to detect the presence of some items in a scene. The items can be rotated and a little scaled (bigger or smaller). I've tried using keypoint detectors but they're not fast and accurate enough. So I've decided to first detect edges in the template and the search area, using Canny ( or a faster edge detection algo ), and then match the edges to find the position, orientation, and size of the match found.
All this needs to be done in less than a second.
I've tried using matchTemplate(), and matchShape() but the former is NOT scale and rotation invariant, and the latter doesn't work well with the actual images. Rotating the template image in order to match is also time consuming.
So far I have been able to detect the edges of the template but I don't know how to match them with the scene.
I've already gone through the following but wasn't able to get them to work (they're either using old version of OpenCV, or just not working with other images apart from those in the demo):
https://www.codeproject.com/Articles/99457/Edge-Based-Template-Matching
Angle and Scale Invariant template matching using OpenCV
https://answers.opencv.org/question/69738/object-detection-kinect-depth-images/
Can someone please suggest me an approach for this? Or a code snipped for the same if possible ?
This is my sample input image ( the parts to detect are marked in red )
These are some software that are doing this and also how I want it should be:

This topic is what I am actually dealing for a year on a project. So I will try to explain what my approach is and how I am doing that. I assume that you already did the preprocess steps(filters,brightness,exposure,calibration etc). And be sure you clean the noises on image.
Note: In my approach, I am collecting data from contours on a reference image which is my desired object. Then I am comparing these data with the other contours on the big image.
Use canny edge detection and find the contours on reference
image. You need to be sure here about that it shouldn't miss some parts of
contours. If it misses, probably preprocess part should have some
problems. The other important point is that you need to find an
appropriate mode of findContours because every modes have
different properties so you need to find an appropriate one for your
case. At the end you need to eliminate the contours which are okey
for you.
After getting contours from reference, you can find the length of
every contours using outputArray of findContours(). You can compare
these values on your big image and eliminate the contours which are
so different.
minAreaRect precisely draws a fitted, enclosing rectangle for
each contour. In my case, this function is very good to use. I am
getting 2 parameters using this function:
a) Calculate the short and long edge of fitted rectangle and compare the
values with the other contours on the big image.
b) Calculate the percentage of blackness or whiteness(if your image is
grayscale, get a percentage how many pixel close to white or black) and
compare at the end.
matchShape can be applied at the end to the rest of contours or you can also apply to all contours(I suggest first approach). Each contour is just an array so you can hold the reference contours in an array and compare them with the others at the end. After doing 3 steps and then applying matchShape is very good on my side.
I think matchTemplate is not good to use directly. I am drawing every contour to a different mat zero image(blank black surface) as a template image and then I compare with the others. Using a reference template image directly doesnt give good results.
OpenCV have some good algorithms about finding circles,convexity etc. If your situations are related with them, you can also use them as a step.
At the end, you just get the all data,values, and you can make a table in your mind. The rest is kind of statistical analysis.
Note: I think the most important part is preprocess part. So be sure about that you have a clean almost noiseless image and reference.
Note: Training can be a good solution for your case if you just want to know the objects exist or not. But if you are trying to do something for an industrial application, this is totally wrong way. I tried YOLO and haarcascade training algorithms several times and also trained some objects with them. The experiences which I get is that: they can find objects almost correctly but the center coordinates, rotation results etc. will not be totally correct even if your calibration is correct. On the other hand, training time and collecting data is painful.

You have rather bad image quality very bad light conditions, so you have only two ways:
1. To use filters -> binary threshold -> find_contours -> matchShape. But this very unstable algorithm for your object type and image quality. You will get a lot of wrong contours and its hard to filter them.
2. Haarcascades -> cut bounding box -> check the shape inside
All "special points/edge matching " algorithms will not work in such bad conditions.

Averaging SIFT features to do pose estimation

I have created a point cloud of an irregular (non-planar) complex object using SfM. Each one of those 3D points was viewed in more than one image, so it has multiple (SIFT) features associated with it.
Now, I want to solve for the pose of this object in a new, different set of images using a PnP algorithm matching the features detected in the new images with the features associated with the 3D points in the point cloud.
So my question is: which descriptor do I associate with the 3D point to get the best results?
So far I've come up with a number of possible solutions...
Average all of the descriptors associated with the 3D point (taken from the SfM pipeline) and use that "mean descriptor" to do the matching in PnP. This approach seems a bit far-fetched to me - I don't know enough about feature descriptors (specifically SIFT) to comment on the merits and downfalls of this approach.
"Pin" all of the descriptors calculated during the SfM pipeline to their associated 3D point. During PnP, you would essentially have duplicate points to match with (one duplicate for each descriptor). This is obviously intensive.
Find the "central" viewpoint that the feature appears in (from the SfM pipeline) and use the descriptor from this view for PnP matching. So if the feature appears in images taken at -30, 10, and 40 degrees ( from surface normal), use the descriptor from the 10 degree image. This, to me, seems like the most promising solution.
Is there a standard way of doing this? I haven't been able to find any research or advice online regarding this question, so I'm really just curious if there is a best solution, or if it is dependent on the object/situation.

The descriptors that are used for matching in most SLAM or SFM systems are rotation and scale invariant (and to some extent, robust to intensity changes). That is why we are able to match them from different view points in the first place. So, in general it doesn't make much sense to try to use them all, average them, or use the ones from a particular image. If the matching in your SFM was done correctly, the descriptors of the reprojection of a 3d point from your point cloud in any of its observations should be very close, so you can use any of them 1.
Also, it seems to me that you are trying to directly match the 2d points to the 3d points. From a computational point of view, I think this is not a very good idea, because by matching 2d points with 3d ones, you lose the spatial information of the images and have to search for matches in a brute force manner. This in turn can introduce noise. But, if you do your matching from image to image and then propagate the results to the 3d points, you will be able to enforce priors (if you roughly know where you are, i.e. from an IMU, or if you know that your images are close), you can determine the neighborhood where you look for matches in your images, etc. Additionally, once you have computed your pose and refined it, you will need to add more points, no? How will you do it if you haven't done any 2d/2d matching, but just 2d/3d matching?
Now, the way to implement that usually depends on your application (how much covisibility or baseline you have between the poses from you SFM, etc). As an example, let's note your candidate image I_0, and let's note the images from your SFM I_1, ..., I_n. First, match between I_0 and I_1. Now, assume q_0 is a 2d point from I_0 that has successfully been matched to q_1 from I_1, which corresponds to some 3d point Q. Now, to ensure consistency, consider the reprojection of Q in I_2, and call it q_2. Match I_0 and I_2. Does the point to which q_0 is match in I_2 fall close to q_2? If yes, keep the 2d/3d match between q_0 and Q, and so on.
I don't have enough information about your data and your application, but I think that depending on your constraints (real-time or not, etc), you could come up with some variation of the above. The key idea anyway is, as I said previously, to try to match from frame to frame and then propagate to the 3d case.
Edit: Thank you for your clarifications in the comments. Here are a few thoughts (feel free to correct me):
Let's consider a SIFT descriptor s_0 from I_0, and let's note F(s_1,...,s_n) your aggregated descriptor (that can be an average or a concatenation of the SIFT descriptors s_i in their corresponding I_i, etc). Then when matching s_0 with F, you will only want to use a subset of the s_i that belong to images that have close viewpoints to I_0 (because of the 30deg problem that you mention, although I think it should be 50deg). That means that you have to attribute a weight to each s_i that depends on the pose of your query I_0. You obviously can't do that when constructing F, so you have to do it when matching. However, you don't have a strong prior on the pose (otherwise, I assume you wouldn't be needing PnP). As a result, you can't really determine this weight. Therefore I think there are two conclusions/options here:
SIFT descriptors are not adapted to the task. You can try coming up with a perspective-invariant descriptor. There is some literature on the subject.
Try to keep some visual information in the form of "Key-frames", as in many SLAM systems. It wouldn't make sense to keep all of your images anyway, just keep a few that are well distributed (pose-wise) in each area, and use those to propagate 2d matches to the 3d case.
If you only match between the 2d point of your query and 3d descriptors without any form of consistency check (as the one I proposed earlier), you will introduce a lot of noise...
tl;dr I would keep some images.
1 Since you say that you obtain your 3d reconstruction from an SFM pipline, some of them are probably considered inliers and some are outliers (indicated by a boolean flag). If they are outliers, just ignore them, if they are inliers, then they are the result of matching and triangulation, and their position has been refined multiple times, so you can trust any of their descriptors.

C++ - Using Bag of Words for matching pictures together?

I would like to compare a picture (with his descriptors) with thousand of pictures inside a database in order to do a matching. (if two pictures are the same,that is to say the same thing but it can bo rotated, a bit blured, has a different scale etc.).
For example :
I saw on StackOverflaw that compute descriptors for each picture and compare them one to one is very a long process.
I did some researches and i saw that i can do an algorithm based on Bag of Words.
I don't know exactly how is works yet, but it seems to be good. But in think, i can be mistaked, it is only to detect what kind of object is it not ?
I would like to know according to you if using it can be a good solution to compare a picture to a thousands of pictures using descriptors like Sift of Surf ?
If yes, do you have some suggestions about how i can do that ?
Thank,

Yes, it is possible. The only thing you have to pay attention is the computational requirement which can be a little overwhelming. If you can narrow the search, that usually help.
To support my answer I will extract some examples from a recent work of ours. We aimed at recognizing a painting on a museum's wall using SIFT + RANSAC matching. We have a database of all the paintings in the museum and a SIFT descriptor for each one of them. We aim at recognizing the paining in a video which can be recorded from a different perspective (all the templates are frontal) or under different lighting conditions. This image should give you an idea: on the left you can see the template and the current frame. The second image is the SIFT matching and the third shows the results after RANSAC.
Once you have the matching between your image and each SIFT descriptor in your database, you can compute the reprojection error, namely the ratio between matched points (after RANSAC) and the total number of keypoints. This can be repeated for each image and the image with the lowest reprojection error can be declared as the match.
We used this for paintings but I think that can be generalized for every kind of image (the android logo you posted in the question is a fair example i think).
Hope this helps!

How to remove false matches from FLANNBASED Matcher in OPENCV?

[Request you to read question details before marking it duplicate or down-voting it. I have searched thoroughly and couldn't find a solution and hence posting the question here.]
I am trying to compare one image with multiple images and get a list of ALL matching images. I do NOT want to draw keypoints between images.
My solution is based on the following source code:
https://github.com/Itseez/opencv/blob/master/samples/cpp/matching_to_many_images.cpp
The above source code matches one image with multiple images and get best matching image.
I have modified the above sample and generated:
vector<vector<DMatch>> matches;
vector<vector<DMatch>> good_matches;
Now my question is how do I apply nearest neighbor search ratio to get good matches for multiple images?
Edit 1:
My implementation is as follows:
For each image in the data-set, compute SURF descriptors.
Combine all the descriptors into one big matrix.
Build a FLANN index from the concatenated matrix.
Compute descriptors for the query image.
Run KNN search over the FLANN index to find top 20 or less best matching image. K is set as 20.
Filter out all the inadequate matches computed in the previous step. (How??)
I have successfully done steps number 1 to 5. I am facing problem in step number 6 wherein I am not able to remove false matches.

There are two answers to your problem. The first is that you should be using a completely different technique, the second answer is how to actually do what you asked for.
Use a different method
You want to find duplicates of a given query image. Traditionally, you do this by comparing global image descriptors, not local feature descriptors.
The simplest way to do this would be by aggregating the local feature descriptors into a local descriptor. The standard method here is "bag of visual words". In OpenCV this is called Bag-Of-Words (like BOWTrainer, BOWImgDescriptorExtractor etc). Have a look at the documentation for using this.
There is some example code in samples/cpp/bagofwords_classification.cpp
The benefits will be that you get more robust results (depending on the implementation of what you are doing now), and that the matching is generally faster.
Use your method
I understand that you want to remove points from the input that lead to false positives in your matching.
You can't remove points from FLANN(1, 2, 3). FLANN builds a tree for fast search. Depending on the type of tree, removing a node becomes impossible. Guess what, FLANN uses a KD-tree which doesn't (easily) allow removal of points.
FlannBasedMatcher does not support masking permissible matches of descriptor sets because flann::Index does not support this.
I would suggest using a radius search instead of a plain search. Alternatively look at the L2-distance of the found matches and in your code write a function to see whether the distance falls below a threshold.
Edit
I should also note that you can rebuild your flann-tree. Obviously, there is a performance penalty when doing this. But if you have a large number of queries and some features coming up as false-positives way too often, it might make sense to do this once.
You need the functions DescriptorMatcher::clear() and then DescriptorMatcher::add(const vector<Mat>& descriptors) for this. Referenz.

Firstly you need to define "inadequate matches", without this definition you can't do anything.
There are several loose definitions that spring to mind:
1: inadequate matches = match with distance > pre-defined distance
In this case a flann radius search may more appropriate as it will only give you those indexes within the predefined radius from the target:
http://docs.opencv.org/trunk/modules/flann/doc/flann_fast_approximate_nearest_neighbor_search.html#flann-index-t-radiussearch
2: inadequate matches = match with distance > dynamically defined distance based on the retrieved k-nn
This is more tricky and off the top of my head i can think of two possible solutions:
2a: Define some ratio test based on the distance the first 1-NN such as:
base distance = distance to 1NN
inadequate match_k = match distance_k >= a * base distance;
2b: Use a dynamic thresholding technique such as the Otsu threshold on the normalized distribution of distances for the k-nn, thus partitioning the the k-nn into two groups, the group that contains the 1-nn is the adequate group, the other is the inadequate group.
http://en.wikipedia.org/wiki/Otsu's_method,
http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html#threshold.

OpenCV detect image against a image set

I would like to know how I can use OpenCV to detect on my VideoCamera a Image. The Image can be one of 500 images.
What I'm doing at the moment:
- (void)viewDidLoad
{
[super viewDidLoad];
// Do any additional setup after loading the view.
self.videoCamera = [[CvVideoCamera alloc] initWithParentView:imageView];
self.videoCamera.delegate = self;
self.videoCamera.defaultAVCaptureDevicePosition = AVCaptureDevicePositionBack;
self.videoCamera.defaultAVCaptureSessionPreset = AVCaptureSessionPresetHigh;
self.videoCamera.defaultAVCaptureVideoOrientation = AVCaptureVideoOrientationPortrait;
self.videoCamera.defaultFPS = 30;
self.videoCamera.grayscaleMode = NO;
}
-(void)viewDidAppear:(BOOL)animated{
[super viewDidAppear:animated];
[self.videoCamera start];
}
#pragma mark - Protocol CvVideoCameraDelegate
#ifdef __cplusplus
- (void)processImage:(cv::Mat&)image;
{
// Do some OpenCV stuff with the image
cv::Mat image_copy;
cvtColor(image, image_copy, CV_BGRA2BGR);
// invert image
//bitwise_not(image_copy, image_copy);
//cvtColor(image_copy, image, CV_BGR2BGRA);
}
#endif
The images that I would like to detect are 2-5kb small. Few got text on them but others are just signs. Here a example:
Do you guys know how I can do that?

There are several things in here. I will break down your problem and point you towards some possible solutions.
Classification: Your main task consists on determining if a certain image belongs to a class. This problem by itself can be decomposed in several problems:
Feature Representation You need to decide how you are gonna model your feature, i.e. how are you going to represent each image in a feature space so you can train a classifier to separate those classes. The feature representation by itself is already a big design decision. One could (i) calculate the histogram of the images using n bins and train a classifier or (ii) you could choose a sequence of random patches comparison such as in a random forest. However, after the training, you need to evaluate the performance of your algorithm to see how good your decision was.
There is a known problem called overfitting, which is when you learn too well that you can not generalize your classifier. This can usually be avoided with cross-validation. If you are not familiar with the concept of false positive or false negative, take a look in this article.
Once you define your feature space, you need to choose an algorithm to train that data and this might be considered as your biggest decision. There are several algorithms coming out every day. To name a few of the classical ones: Naive Bayes, SVM, Random Forests, and more recently the community has obtained great results using Deep learning. Each one of those have their own specific usage (e.g. SVM ares great for binary classification) and you need to be familiar with the problem. You can start with simple assumptions such as independence between random variables and train a Naive Bayes classifier to try to separate your images.
Patches: Now you mentioned that you would like to recognize the images on your webcam. If you are going to print the images and display in a video, you need to handle several things. it is necessary to define patches on your big image (input from the webcam) in which you build a feature representation for each patch and classify in the same way you did in the previous step. For doing that, you could slide a window and classify all the patches to see if they belong to the negative class or to one of the positive ones. There are other alternatives.
Scale: Considering that you are able to detect the location of images in the big image and classify it, the next step is to relax the toy assumption of fixes scale. To handle a multiscale approach, you could image pyramid which pretty much allows you to perform the detection in multiresolution. Alternative approaches could consider keypoint detectors, such as SIFT and SURF. Inside SIFT, there is an image pyramid which allows the invariance.
Projection So far we assumed that you had images under orthographic projection, but most likely you will have slight perspective projections which will make the whole previous assumption fail. One naive solution for that would be for instance detect the corners of the white background of your image and rectify the image before building the feature vector for classification. If you used SIFT or SURF, you could design a way of avoiding explicitly handling that. Nevertheless, if your input is gonna be just squares patches, such as in ARToolkit, I would go for manual rectification.
I hope I might have given you a better picture of your problem.

I would recommend using SURF for that, because pictures can be on different distances form your camera, i.e changing the scale. I had one similar experiment and SURF worked just as expected. But SURF has very difficult adjustment (and expensive operations), you should try different setups before you get the needed results.
Here is a link: http://docs.opencv.org/modules/nonfree/doc/feature_detection.html
youtube video (in C#, but can give an idea): http://www.youtube.com/watch?v=zjxWpKCQqJc

I might not be qualified enough to answer this problem. Last time I seriously use OpenCV it was still 1.1. But just some thought on it, and hope it would help (currently I am interested in DIP and ML).
I think it will probably an easier task if you only need to classify an image, if the image is just one from (or very similar to) your 500 images. For this you could use SVM or some neural network (Felix already gave an excellent enumeration on that).
However, your problem seems to be that you need to first find this candidate image in your webcam, the location of which you have little clue beforehand. (let us know whether it is so. I think it is important.)
If so, the harder problem is the detection/localization of your candidate image.
I don't have a general solution for that. The first thing I would do is to see if there is some common feature in your 500 images (e.g., whether all of them enclosed by a red circle, or, half of them have circle and half of them have rectangle). If this can be done, the problem will be simpler (it would be similar to face detection problem, which have good solution).
In other words, this means that you first classify the 500 images to a few groups with common feature (by human), and detect the group first, then scale and use above mentioned technique to classify them into fine result. In this way, it will be more computationally acceptable than trying to detect 500 images one by one.
BTW, this ppt would help to give a visual clue of what is going on for feature extraction and image matching http://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect6.pdf.

Detect vs recognize: detecting the image is just finding it on the background and from your comments I realized you may have your sings surrounded by the background. It might facilitate your algorithm if you can somehow crop your signs from the background (detect) before trying to recognize them. Recognizing is a next stage that presumes you can classify the cropped image correctly as the one seen before.
If you need real time speed and scale/rotation invariance neither SIFT no SURF will do this fast. Nowadays you can do much better if you shift the burden of image processing to a learning stage as was done by Lepitit. In short, he subjected each pattern to a bunch of affine transformations and trained a binary classification tree to recognize each point correctly by doing a lot of binary comparison tests. Trees are extremely fast and a way to go not to mention that most of the processing is done offline. This method is also more robust to off-plane rotations than SIFT or SURF. You will also learn about tree classification which may facilitate you last processing stage.
Finally a recognition stage is based not only on the number of matches but also on their geometric consistency. Since your signs look flat I suggest finding either affine or homography transformation that has most inliers when calculated between matched points.
Looking at your code though I realized that you may not follow any of these recommendations. It may be a good starting point for you to read about decision trees and then play with some sample code (see mushroom.cpp in the above mentioned link)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js