How to remove false matches from FLANNBASED Matcher in OPENCV? - c++

[Request you to read question details before marking it duplicate or down-voting it. I have searched thoroughly and couldn't find a solution and hence posting the question here.]
I am trying to compare one image with multiple images and get a list of ALL matching images. I do NOT want to draw keypoints between images.
My solution is based on the following source code:
https://github.com/Itseez/opencv/blob/master/samples/cpp/matching_to_many_images.cpp
The above source code matches one image with multiple images and get best matching image.
I have modified the above sample and generated:
vector<vector<DMatch>> matches;
vector<vector<DMatch>> good_matches;
Now my question is how do I apply nearest neighbor search ratio to get good matches for multiple images?
Edit 1:
My implementation is as follows:
For each image in the data-set, compute SURF descriptors.
Combine all the descriptors into one big matrix.
Build a FLANN index from the concatenated matrix.
Compute descriptors for the query image.
Run KNN search over the FLANN index to find top 20 or less best matching image. K is set as 20.
Filter out all the inadequate matches computed in the previous step. (How??)
I have successfully done steps number 1 to 5. I am facing problem in step number 6 wherein I am not able to remove false matches.

There are two answers to your problem. The first is that you should be using a completely different technique, the second answer is how to actually do what you asked for.
Use a different method
You want to find duplicates of a given query image. Traditionally, you do this by comparing global image descriptors, not local feature descriptors.
The simplest way to do this would be by aggregating the local feature descriptors into a local descriptor. The standard method here is "bag of visual words". In OpenCV this is called Bag-Of-Words (like BOWTrainer, BOWImgDescriptorExtractor etc). Have a look at the documentation for using this.
There is some example code in samples/cpp/bagofwords_classification.cpp
The benefits will be that you get more robust results (depending on the implementation of what you are doing now), and that the matching is generally faster.
Use your method
I understand that you want to remove points from the input that lead to false positives in your matching.
You can't remove points from FLANN(1, 2, 3). FLANN builds a tree for fast search. Depending on the type of tree, removing a node becomes impossible. Guess what, FLANN uses a KD-tree which doesn't (easily) allow removal of points.
FlannBasedMatcher does not support masking permissible matches of descriptor sets because flann::Index does not support this.
I would suggest using a radius search instead of a plain search. Alternatively look at the L2-distance of the found matches and in your code write a function to see whether the distance falls below a threshold.
Edit
I should also note that you can rebuild your flann-tree. Obviously, there is a performance penalty when doing this. But if you have a large number of queries and some features coming up as false-positives way too often, it might make sense to do this once.
You need the functions DescriptorMatcher::clear() and then DescriptorMatcher::add(const vector<Mat>& descriptors) for this. Referenz.

Firstly you need to define "inadequate matches", without this definition you can't do anything.
There are several loose definitions that spring to mind:
1: inadequate matches = match with distance > pre-defined distance
In this case a flann radius search may more appropriate as it will only give you those indexes within the predefined radius from the target:
http://docs.opencv.org/trunk/modules/flann/doc/flann_fast_approximate_nearest_neighbor_search.html#flann-index-t-radiussearch
2: inadequate matches = match with distance > dynamically defined distance based on the retrieved k-nn
This is more tricky and off the top of my head i can think of two possible solutions:
2a: Define some ratio test based on the distance the first 1-NN such as:
base distance = distance to 1NN
inadequate match_k = match distance_k >= a * base distance;
2b: Use a dynamic thresholding technique such as the Otsu threshold on the normalized distribution of distances for the k-nn, thus partitioning the the k-nn into two groups, the group that contains the 1-nn is the adequate group, the other is the inadequate group.
http://en.wikipedia.org/wiki/Otsu's_method,
http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html#threshold.

Related

Averaging SIFT features to do pose estimation

I have created a point cloud of an irregular (non-planar) complex object using SfM. Each one of those 3D points was viewed in more than one image, so it has multiple (SIFT) features associated with it.
Now, I want to solve for the pose of this object in a new, different set of images using a PnP algorithm matching the features detected in the new images with the features associated with the 3D points in the point cloud.
So my question is: which descriptor do I associate with the 3D point to get the best results?
So far I've come up with a number of possible solutions...
Average all of the descriptors associated with the 3D point (taken from the SfM pipeline) and use that "mean descriptor" to do the matching in PnP. This approach seems a bit far-fetched to me - I don't know enough about feature descriptors (specifically SIFT) to comment on the merits and downfalls of this approach.
"Pin" all of the descriptors calculated during the SfM pipeline to their associated 3D point. During PnP, you would essentially have duplicate points to match with (one duplicate for each descriptor). This is obviously intensive.
Find the "central" viewpoint that the feature appears in (from the SfM pipeline) and use the descriptor from this view for PnP matching. So if the feature appears in images taken at -30, 10, and 40 degrees ( from surface normal), use the descriptor from the 10 degree image. This, to me, seems like the most promising solution.
Is there a standard way of doing this? I haven't been able to find any research or advice online regarding this question, so I'm really just curious if there is a best solution, or if it is dependent on the object/situation.
The descriptors that are used for matching in most SLAM or SFM systems are rotation and scale invariant (and to some extent, robust to intensity changes). That is why we are able to match them from different view points in the first place. So, in general it doesn't make much sense to try to use them all, average them, or use the ones from a particular image. If the matching in your SFM was done correctly, the descriptors of the reprojection of a 3d point from your point cloud in any of its observations should be very close, so you can use any of them 1.
Also, it seems to me that you are trying to directly match the 2d points to the 3d points. From a computational point of view, I think this is not a very good idea, because by matching 2d points with 3d ones, you lose the spatial information of the images and have to search for matches in a brute force manner. This in turn can introduce noise. But, if you do your matching from image to image and then propagate the results to the 3d points, you will be able to enforce priors (if you roughly know where you are, i.e. from an IMU, or if you know that your images are close), you can determine the neighborhood where you look for matches in your images, etc. Additionally, once you have computed your pose and refined it, you will need to add more points, no? How will you do it if you haven't done any 2d/2d matching, but just 2d/3d matching?
Now, the way to implement that usually depends on your application (how much covisibility or baseline you have between the poses from you SFM, etc). As an example, let's note your candidate image I_0, and let's note the images from your SFM I_1, ..., I_n. First, match between I_0 and I_1. Now, assume q_0 is a 2d point from I_0 that has successfully been matched to q_1 from I_1, which corresponds to some 3d point Q. Now, to ensure consistency, consider the reprojection of Q in I_2, and call it q_2. Match I_0 and I_2. Does the point to which q_0 is match in I_2 fall close to q_2? If yes, keep the 2d/3d match between q_0 and Q, and so on.
I don't have enough information about your data and your application, but I think that depending on your constraints (real-time or not, etc), you could come up with some variation of the above. The key idea anyway is, as I said previously, to try to match from frame to frame and then propagate to the 3d case.
Edit: Thank you for your clarifications in the comments. Here are a few thoughts (feel free to correct me):
Let's consider a SIFT descriptor s_0 from I_0, and let's note F(s_1,...,s_n) your aggregated descriptor (that can be an average or a concatenation of the SIFT descriptors s_i in their corresponding I_i, etc). Then when matching s_0 with F, you will only want to use a subset of the s_i that belong to images that have close viewpoints to I_0 (because of the 30deg problem that you mention, although I think it should be 50deg). That means that you have to attribute a weight to each s_i that depends on the pose of your query I_0. You obviously can't do that when constructing F, so you have to do it when matching. However, you don't have a strong prior on the pose (otherwise, I assume you wouldn't be needing PnP). As a result, you can't really determine this weight. Therefore I think there are two conclusions/options here:
SIFT descriptors are not adapted to the task. You can try coming up with a perspective-invariant descriptor. There is some literature on the subject.
Try to keep some visual information in the form of "Key-frames", as in many SLAM systems. It wouldn't make sense to keep all of your images anyway, just keep a few that are well distributed (pose-wise) in each area, and use those to propagate 2d matches to the 3d case.
If you only match between the 2d point of your query and 3d descriptors without any form of consistency check (as the one I proposed earlier), you will introduce a lot of noise...
tl;dr I would keep some images.
1 Since you say that you obtain your 3d reconstruction from an SFM pipline, some of them are probably considered inliers and some are outliers (indicated by a boolean flag). If they are outliers, just ignore them, if they are inliers, then they are the result of matching and triangulation, and their position has been refined multiple times, so you can trust any of their descriptors.

opencv c++ compare keypoint locations in different images

When comparing 2 images via feature extraction, how do you compare keypoint distances so to disregard those that are obviously incorrect?
I've found when comparing similar images against each other, most of the time it can fairly accurate, but other times it can throws matches that are completely separate.
So I'm after a way of looking at the 2 sets of keypoints from both images and determining whether the matched keypoints are relatively in the same locations on both. As in it knows that keypoints 1, 2, and 3 are so far apart on image 1, so the corresponding keypoints matched on image 2 should be of a fairly similar distance away from each other again.
I've used RANSAC and minimum distance checks in the past but only to some effect, they don't seem to be as thorough as I'm after.
(Using ORB and BruteForce)
EDIT
Changed "x, y, and z" to "1, 2, and 3"
EDIT 2 -- I'll try to explain further with quick Paint made examples:
Say I have this as my image:
And I give it this image to compare against:
Its a cropped and squashed version of the original, but obviously similar.
Now, say you ran it through feature detection and it came back with these results for the keypoints for the two images:
The keypoints on both images are in roughly the same areas, and proportionately the same distance away from each other. Take the keypoint I've circled, lets call it "Image 1 Keypoint 1".
We can see that there are 5 keypoints around it. Its these distances between them and "Image 1 Keypoint 1" that I want to obtain so to compare them against "Image 2 Keypoint 1" and its 5 surround keypoints in the same area (see below) so as to not just compare a keypoint to another keypoint, but to compare "known shapes" based off of the locations of the keypoints.
--
Does that make sense?
Keypoint matching is a problem with several dimensions. These dimensions are:
spatial distance, ie, the (x,y) distance as measured from the locations of two keypoints in different images
feature distance, that is, a distance that describes how much two keypoints look alike.
Depending on your context, you do not want to compute the same distance, or you want to combine both. Here are some use cases:
optical flow, as implemented by opencv's sparse Lucas-Kanade optical flow. In this case, keypoints called good features are computed in each frame, then matched on a spatial distance basis. This works because the image is supposed to change relatively slowly (the input frames have a video framerate);
image stitching, as you can implement from opencv's features2d (free or non-free). In this case, the images change radically since you move your camera around. Then, your goal becomes to find stable points, ie, points that are present in two or more images whatever their location is. In this case you will use feature distance. This also holds when you have a template image of an object that you want to find in query images.
In order to compute feature distance, you need to compute a coded version of their appearance. This operation is performed by the DescriptorExtractor class.
Then, you can compute distances between the output of the descriptions: if the distance between two descriptions is small then the original keypoints are very likely to correspond to the same scene point.
Pay attention when you compute distances to use the correct distance function: ORB, FREAK, BRISK rely on Hamming distance, while SIFt and SURF use a more usual L2 distance.
Match filtering
When you have individual matches, you may want to perform match filtering in order to reject good individual matches that may arise from scene ambiguities. Think for example of a keypoint that originates from the corner of a window of a house. Then it is very likely to match with another window in another house, but this may not be the good house or the good window.
You have several ways of doing it:
RANSAC performs a consistency check of the computed matches with the current solution estimate. Basically, it picks up some matches at random, computes a solution to the problem (usually a geometric transform between 2 images) and then counts how many of the matchings agree with this estimate. The estimate with the higher count of inliers wins;
David Lowe performed another kind of filtering in the original SIFT paper.
He kept the two best candidates for a match with a given query keypoint, ie, points that had the lowest distance (or highest similarity). Then, he computed the ratio similarity(query, best)/similarity(query, 2nd best). If this ratio is too low then the second best is also a good candidate for a match, and the matching result is dubbed ambiguous and rejected.
Exactly how you should do it in your case is thus very likely to depend on your exact application.
Your specific case
In your case, you want to develop an alternate feature descriptor that is based on neighbouring keypoints.
The sky is obviously the limit here, but here are some steps that I would follow:
make your descriptor rotation and scale invariant by computing the PCA of the keypoints :
// Form a matrix from KP locations in current image
cv::Mat allKeyPointsMatrix = gatherAllKeypoints(keypoints);
// Compute PCA basis
cv::PCA currentPCA(allKeyPointsMatrix, 2);
// Reproject keypoints in new basis
cv::Mat normalizedKeyPoints = currentPCA.project(allKeyPointsMatrix);
(optional) sort the keypoints in a quadtree or kd-tree for faster spatial indexing
Compute for each keypoint a descriptor that is (for example) the offsets in normalized coordinates of the 4 or 5 closest keypoints
Do the same in your query image
Match keypoints from both mages based on these new descriptors.
What is it you are trying to do exactly? More information is needed to give you a good answer. Otherwise it'll have to be very broad and most likely not useful to your needs.
And with your statement "determining whether the matched keypoints are relatively in the same locations on both" do you mean literally on the same x,y positions between 2 images?
I would try out the SURF algorithm. It works extremely well for what you described above (though I found it to be a bit slow unless you use gpu acceleration, 5fps vs 34fps).
Here is the tutorial for surf, I personally found it very useful, but the executables are for linux users only. However you can simply remove the OS specific bindings in the source code and keep only the opencv related bindings and have it compile + run on linux just the same.
https://code.google.com/p/find-object/#Tutorials
Hope this helped!
You can do a filter on the pixels distance between two keypoint.
Let's say matches is your vector of matches, kp_1 your vector of keypoints on the first picture and kp_2 on the second. You can use the code above to eliminate obviously incorrect matches. You just need to fix a threshold.
double threshold= YourValue;
vector<DMatch> good_matches;
for (int i = 0; i < matches.size(); i++)
{
double dist_p = sqrt(pow(abs(kp_1[matches[i][0].queryIdx].pt.x - kp_2[matches[i][0].trainIdx].pt.x), 2) + pow(abs(kp_1[matches[i][0].queryIdx].pt.y - kp_2[matches[i][0].trainIdx].pt.y), 2));
if (dist_p < threshold)
{
good_matches.push_back(matches[i][0]);
}
}

C++ - Using Bag of Words for matching pictures together?

I would like to compare a picture (with his descriptors) with thousand of pictures inside a database in order to do a matching. (if two pictures are the same,that is to say the same thing but it can bo rotated, a bit blured, has a different scale etc.).
For example :
I saw on StackOverflaw that compute descriptors for each picture and compare them one to one is very a long process.
I did some researches and i saw that i can do an algorithm based on Bag of Words.
I don't know exactly how is works yet, but it seems to be good. But in think, i can be mistaked, it is only to detect what kind of object is it not ?
I would like to know according to you if using it can be a good solution to compare a picture to a thousands of pictures using descriptors like Sift of Surf ?
If yes, do you have some suggestions about how i can do that ?
Thank,
Yes, it is possible. The only thing you have to pay attention is the computational requirement which can be a little overwhelming. If you can narrow the search, that usually help.
To support my answer I will extract some examples from a recent work of ours. We aimed at recognizing a painting on a museum's wall using SIFT + RANSAC matching. We have a database of all the paintings in the museum and a SIFT descriptor for each one of them. We aim at recognizing the paining in a video which can be recorded from a different perspective (all the templates are frontal) or under different lighting conditions. This image should give you an idea: on the left you can see the template and the current frame. The second image is the SIFT matching and the third shows the results after RANSAC.
Once you have the matching between your image and each SIFT descriptor in your database, you can compute the reprojection error, namely the ratio between matched points (after RANSAC) and the total number of keypoints. This can be repeated for each image and the image with the lowest reprojection error can be declared as the match.
We used this for paintings but I think that can be generalized for every kind of image (the android logo you posted in the question is a fair example i think).
Hope this helps!

Measure of accuracy in pattern recognition using SURF in OpenCV

I’m currently working on pattern recognition using SURF in OpenCV. What do I have so far: I’ve written a program in C# where I can select a source-image and a template which I want to find. After that I transfer both pictures into a C++-dll where I’ve implemented a program using the OpenCV-SURFdetector, which returns all the keypoints and matches back to my C#-program where I try to draw a rectangle around my matches.
Now my question: Is there a common measure of accuracy in pattern recognition? Like for example number of matches in proportion to the number of keypoints in the template? Or maybe the size-difference between my match-rectangle and the original size of the template-image? What are common parameters that are used to say if a match is a “real” and “good” match?
Edit: To make my question clearer. I have a bunch of matchpoints, that are already thresholded by minHessian and distance value. After that I draw something like a rectangle around my matchpoints as you can see in my picture. This is my MATCH. How can I tell now how good this match is? I'm already calculating angle, size and color differences between my now found match and my template. But I think that is much too vague.
I am not 100% sure about what you are really asking, because what you call a "match" is vague. But since you said you already matched your SURF points and mentionned pattern recognition and the use of a template, I am assuming that, ultimately, you want to localize the template in your image and you are asking about a localization score to decide whether you found the template in the image or not.
This is a challenging problem and I am not aware that a good and always-appropriate solution has been found yet.
However, given your approach, what you could do is analyze the density of matched points in your image: consider local or global maximas as possible locations for your template (global if you know your template appears only once in the image, local if it can appear multiple times) and use a threshold on the density to decide whether or not the template appears. A sketch of the algorithm could be something like this:
Allocate a floating point density map of the size of your image
Compute the density map, by increasing by a fixed amount the density map in the neighborhood of each matched point (for instance, for each matched point, add a fixed value epsilon in the rectangle your are displaying in your question)
Find the global or local maximas of the density map (global can be found using opencv function MinMaxLoc, and local maximas can be found using morpho maths, e.g. How can I find local maxima in an image in MATLAB?)
For each maxima obtained, compare the corresponding density value to a threshold tau, to decide whether your template is there or not
If you are into resarch articles, you can check the following ones for improvement of this basic algorithm:
"ADABOOST WITH KEYPOINT PRESENCE FEATURES FOR REAL-TIME VEHICLE VISUAL DETECTION", by T.Bdiri, F.Moutarde, N.Bourdis and B.Steux, 2009.
"Interleaving Object Categorization and Segmentation", by B.Leibe and B.Schiele, 2006.
EDIT: another way to address your problem is to try and remove accidently-matched points in order to keep only those truly corresponding to your template image. This can be done by enforcing a constraint of consistancy between close matched points. The following research article presents an approach like this: "Context-dependent logo matching and retrieval", by H.Sahbi, L.Ballan, G.Serra, A.Del Bimbo, 2010 (however, this may require some background knowledge...).
Hope this helps.
Well, when you compare points you use some metric.
So results or comparison have some resulting distance.
And the less this distance is the better.
Example of code:
BFMatcher matcher(NORM_L2,true);
vector<DMatch> matches;
matcher.match(descriptors1, descriptors2, matches);
matches.erase(std::remove_if(matches.begin(),matches.end(),bad_dist),matches.end());
where bad_dist is defined as
bool dist(const DMatch &m) {
return m.distance > 150;
}
In this code i get rid of 'bad' matches.
There are many ways to match two patterns in the same image, actually it's a very open topic in computer vision, because there isn't a global best solution.
For instance, if you know your object can appear rotated (I'm not familiar with SURF, but I guess the descriptors are rotation invariant like SIFT descriptors), you can estimate the rotation between the pattern you have in the training set and the pattern you just matched with. A match with the minimum error will be a better match.
I recommend you consult Computer Vision: Algorithms and Applications. There's no code in it, but lots of useful techniques typically used in computer vision (most of them already implemented in opencv).

Is there a feature extraction method that is scale invariant but not rotation invariant?

Is there a feature extraction method that is scale invariant but not rotation invariant? I would like to match similar images that have been scaled but not rotated...
EDIT: Let me rephrase. How can I check whether an image is a scaled version (or close to) of an original?
Histogram and Gauss Pyramids are used for extract scale invariant features.
How can I check whether an image is a scaled version (or close to) of an original?
It's puzzle for me. Does you mean given two images, one is original and another is scaled? Or is one original and another is a fragment from the original but scaled and you want to locate the fragment in the original one?
[updated]
Given two images, a and b,
Detected their SIFT or SURF feature points and descriptions.
Get the corresponding regions between a and b. If none, return false. Refer to Pattern Matching - Find reference object in second image and Trying to match two images using sift in OpenCv, but too many matches. Name the region in a as Ra, and one in b as Rb.
Using algorithm like template match to determine if Ra is identical enough to Rb. If yes, calculate the scale ratio.