Best algorithm for video stabilization

Best algorithm for video stabilization - c++

I am creating a program to stabilize the video stream. At the moment, my program works based on the phase correlation algorithm. I'm calculating an offset between two images - base and current. Next I correct the current image according to the new coordinates. This program works, but the result is not satisfactory. The related links you may find that the treated video appears undesirable and shake the whole video is becoming worse.
Orininal video
Unshaked video
There is my current realisation:
Calculating offset between images:
Point2d calculate_offset_phase_optimized(Mat one, Mat& two) {
if(two.type() != CV_64F) {
cvtColor(two, two, CV_BGR2GRAY);
two.convertTo(two, CV_64F);
}
cvtColor(one, one, CV_BGR2GRAY);
one.convertTo(one, CV_64F);
return phaseCorrelate(one, two);
}
Shifting image according this coordinate:
void move_image_roi_alt(Mat& img, Mat& trans, const Point2d& offset) {
trans = Mat::zeros(img.size(), img.type());
img(
Rect(
_0(static_cast<int>(offset.x)),
_0(static_cast<int>(offset.y)),
img.cols-abs(static_cast<int>(offset.x)),
img.rows-abs(static_cast<int>(offset.y))
)
).copyTo(trans(
Rect(
_0ia(static_cast<int>(offset.x)),
_0ia(static_cast<int>(offset.y)),
img.cols-abs(static_cast<int>(offset.x)),
img.rows-abs(static_cast<int>(offset.y))
)
));
}
int _0(const int x) {
return x < 0 ? 0 : x;
}
int _0ia(const int x) {
return x < 0 ? abs(x) : 0;
}
I was looking through the document authors stabilizer YouTube and algorithm based on corner detection seemed attractive, but I'm not entirely clear how it works.
So my question is how to effectively solve this problem.
One of the conditions - the program will run on slower computers, so heavy algorithms may not be suitable.
Thanks!
P.S.
I apologize for any mistakes in the text - it is an automatic translation.

You can use image descriptors such as SIFT in each frame and calculate robust matches between the frames. Then you can calculate homography between the frames and use that to align them. Using sparse features can lead to faster implementation than using a dense correlation.
Alternately, if you know the camera parameters you can calculate 3D positions of the points and of the cameras and reproject the images onto a stable projection plane. In the result, you also get a sparse 3D reconstruction of the scene (somewhat imprecise, usually it needs to be optimized to be usable). This is what e.g. Autostitch would do, but it is quite difficult to implement, however.
Note that the camera parameters can also be calculated, but that is even more difficult.

OpenCV can do it for you in 3 lines of code (it is definitely shortest way, may be even the best):
t = estimateRigidTransform(newFrame, referenceFrame, 0); // 0 means not all transformations (5 of 6)
if(!t.empty()){
warpAffine(newFrame, stableFrame, t, Size(newFrame.cols, newFrame.rows)); // stableFrame should be stable now
}
You can turn off some kind of transformations by modifying matrix t, it can lead to more stable result. It is just core idea, then you can modify it in the way you want: change referenceFrame, smooth set of transformation parameters from matrix t etc.

Related

What is the correct way to average several ARUCO rotation vectors

I am writing a Eye On Hand Calibration Program.
To do that, I am moving the camera mounted on the robot arm to 20 different positions, looking at a single aruco marker.
The translation vector is very stable, but the rotation axes flicker, introducing an error into the resulting calibration matrix.
Therefore, I would like to average X number of frames' rotation vectors (The aruco library does return rotation vectors and translation vectors separately).
Here is the important part of the code
cv::aruco::detectMarkers(image, dictionary, markerCorners, markerIds, parameters, rejectedCandidates);
outputImage = image.clone();
cv::aruco::drawDetectedMarkers(outputImage, markerCorners, markerIds);
cv::aruco::estimatePoseSingleMarkers(markerCorners, 0.05, camMatrix, distCoeffs, rvecs, tvecs);
rvecs is actualy a vector of rotation vectors, with only one member because there is only one aruco marker.
If a marker is found in the frame then,
if (rvecs.size() == 1) { // There is one marker good frame
framesFound++;
for (int i = 0; i < 3; i++) {
avgRvecs[i] =+ rvecs[0][i];
avgTvecs[i] =+ tvecs[0][i];
}
}
And after all the desired frames to average have been processed,
if (framesFound == 0 ) { // No frames with markers...
} else {
for (int i = 0; i < 3; i++) {
avgRvecs[i] = avgRvecs[i] / framesFound;
avgTvecs[i] = avgTvecs[i] / framesFound;
}
cv::drawFrameAxes(outputImage, camMatrix, distCoeffs, avgRvecs, avgTvecs, 0.1);
With a single frame I get
With 10 averaged frames I get

Because the pose estimation of Aruco markers is usually done with IPPE (the algorithm under the hood of solvePnP), you might have some "singularities" in the rotation results.
Averaging the rotations can be a good solution as it works as a low-pass filter, though you need to remember that if you are seeking precision, this might not be the most appropriate filter to use.
Most of the time, we like to manipulate Euler angles, Direct Cosine Matrix, or Rodriguez angles. Unfortunately, none of those is the ideal solution to average angular values. If you want to manipulate angles without mathematical errors, I definitely recommend having a look at quaternions.
Some existing posts suggest interesting approaches with quaternions :
"Average" of multiple quaternions?
https://math.stackexchange.com/questions/1984608/average-of-3d-rotations
Math can be a little scary but the posts come with well written code examples.

OpenCV Dense feature detector

I am using openCV to do some dense feature extraction. For example, The code
DenseFeatureDetector detector(12.f, 1, 0.1f, 10);
I don't really understand the parameters in the above constructor. What does it mean ? Reading the opencv documentation about it does not help much either. In the documentation the arguments are:
DenseFeatureDetector( float initFeatureScale=1.f, int featureScaleLevels=1,
float featureScaleMul=0.1f,
int initXyStep=6, int initImgBound=0,
bool varyXyStepWithScale=true,
bool varyImgBoundWithScale=false );
What are they supposed to do ? i.e. what is the meaning of scale, initFeatureScale, featureScaleLevels etc ? How do you know the grid or grid spacing etc for the dense sampling.

I'm using opencv with dense detector too and I think I can help you with something. I'm not sure about what I'm going to say but the experience learnt me that.
When I use Dense detector I pass there the gray scale image. The detector makes some threshold filters where opencv uses a gray minimum value with is used to transform the image. The píxels where have a more gray level than the threshold will be made like black points and the others are white point. This action is repeated in a loop where the threshold will be bigger and bigger. So the parameter initFeatureScale determine the first threshold you put to do this loop, the featureScaleLevels parameter indicates how much this threshold is bigger between one loop iteration and the next one and featureScaleMul is a multiply factor to calculate the next threshold.
Anyway if you are looking for a your optimal parameters to use Dense detector to detect any particular points You would offer a program I made for that. It is liberated in github. This is a program where you can test some detectors (Dense detector is one of them) and check how it works if you change their parameters thanks to a user interface that let you change the detectors parameters as long as you are executing the program. You will see how the detected points will be change. For try it just click on the link, and download the files. You might need almost all the files to execute the program.

Apologies in advance, i'm predominantly using Python so i'll avoid embarressing myself by referring to C++.
DenseFeatureDetector populates a vector with KeyPoints to pass to compute feature descriptors. These keypoints have a point vector and their scale set. In the documentation, scale is the pixel radius of the keypoint.
KeyPoints are evenly spaced across the width and height of the image matrix passed to DenseFeatureVector.
Now to the arguments:
initFeatureScale
Set the initial KeyPoint feature radius in pixels (as far as I am aware this has no effect)
featureScaleLevels
Number of scales overwhich we wish to make keypoints
featureScaleMuliplier
Scale adjustment for initFeatureScale over featureScaleLevels, this scale adjustment can also be applied to the border (initImgBound) and the step size (initxystep). So when we set featureScaleLevels>1 then this multiplier will be applied to successive scales, to adjust feature scale, step and the boundary around the image.
initXyStep
moving column and row step in pixels. Self explanatory I hope.
initImgBound
row/col bounding region to ignore around the image (pixels), So a 100x100 image, with an initImgBound of 10, would create keypoints in the central 80x80 portion of the image.
varyXyStepWithScale
Boolean, if we have multiple featureScaleLevels do we want to adjust the step size using featureScaleMultiplier.
varyImgBoundWithScale
Boolean,as varyXyStepWithScale, but applied to the border.
Here is the DenseFeatureDetector source code from detectors.cpp in the OpenCV 2.4.3 source, which will probably explain better than my words:
DenseFeatureDetector::DenseFeatureDetector( float _initFeatureScale, int _featureScaleLevels,
float _featureScaleMul, int _initXyStep,
int _initImgBound, bool _varyXyStepWithScale,
bool _varyImgBoundWithScale ) :
initFeatureScale(_initFeatureScale), featureScaleLevels(_featureScaleLevels),
featureScaleMul(_featureScaleMul), initXyStep(_initXyStep), initImgBound(_initImgBound),
varyXyStepWithScale(_varyXyStepWithScale), varyImgBoundWithScale(_varyImgBoundWithScale)
{}
void DenseFeatureDetector::detectImpl( const Mat& image, vector<KeyPoint>& keypoints, const Mat& mask ) const
{
float curScale = static_cast<float>(initFeatureScale);
int curStep = initXyStep;
int curBound = initImgBound;
for( int curLevel = 0; curLevel < featureScaleLevels; curLevel++ )
{
for( int x = curBound; x < image.cols - curBound; x += curStep )
{
for( int y = curBound; y < image.rows - curBound; y += curStep )
{
keypoints.push_back( KeyPoint(static_cast<float>(x), static_cast<float>(y), curScale) );
}
}
curScale = static_cast<float>(curScale * featureScaleMul);
if( varyXyStepWithScale ) curStep = static_cast<int>( curStep * featureScaleMul + 0.5f );
if( varyImgBoundWithScale ) curBound = static_cast<int>( curBound * featureScaleMul + 0.5f );
}
KeyPointsFilter::runByPixelsMask( keypoints, mask );
}
You might expect a call to compute would calculate additional KeyPoint characteristics using the relevant keypoint detection algorithm (e.g. angle), based on the KeyPoints generated by DenseFeatureDetector. Unfortunately this isn't the case for SIFT under Python - i've not looked at at the other feature detectors, nor looked at the behaviour in C++.
Also note that DenseFeatureDetector is not in OpenCV 3.2 (unsure at which release it was removed).

Matching small grayscale images

I want to test whether two images match. Partial matches also interest me.
The problem is that the images suffer from strong noise. Another problem is that the images might be rotated with an unknown angle. The objects shown in the images will roughly always have the same scale!
The images show area scans from a top-shot perspective. "Lines" are mostly walls and other objects are mostly trees and different kinds of plants.
Another problem was, that the left image was very blurry and the right one's lines were very thin.
To compensate for this difference I used dilation. The resulting images are the ones I uploaded.
Although It can easily be seen that these images match almost perfectly I cannot convince my algorithm of this fact.
My first idea was a feature based matching, but the matches are horrible. It only worked for a rotation angle of -90°, 0° and 90°. Although most descriptors are rotation invariant (in past projects they really were), the rotation invariance seems to fail for this example.
My second idea was to split the images into several smaller segments and to use template matching. So I segmented the images and, again, for the human eye they are pretty easy to match. The goal of this step was to segment the different walls and trees/plants.
The upper row are parts of the left, and the lower are parts of the right image. After the segmentation the segments were dilated again.
As already mentioned: Template matching failed, as did contour based template matching and contour matching.
I think the dilation of the images was very important, because it was nearly impossible for the human eye to match the segments without dilation before the segmentation. Another dilation after the segmentation made this even less difficult.

Your first job should be to fix the orientation. I am not sure what is the best algorithm to do that but here is an approach I would use: fix one of the images and start rotating the other. For each rotation compute a histogram for the color intense on each of the rows/columns. Compute some distance between the resulting vectors(e.g. use cross product). Choose the rotation that results in smallest cross product. It may be good idea to combine this approach with hill climbing.
Once you have the images aligned in approximately the same direction, I believe matching should be easier. As the two images are supposed to be at the same scale, compute something analogous to the geometrical center for both images: compute weighted sum of all pixels - a completely white pixel would have a weight of 1, and a completely black - weight 0, the sum should be a vector of size 2(x and y coordinate). After that divide those values by the dimensions of the image and call this "geometrical center of the image". Overlay the two images in a way that the two centers coincide and then once more compute cross product for the difference between the images. I would say this should be their difference.

You can also try following methods to find rotation and similarity.
Use image moments to get the rotation as shown here.
Once you rotate the image, use cross-correlation to evaluate the similarity.
EDIT
I tried this with OpenCV and C++ for the two sample images. I'm posting the code and results below as it seems to work well at least for the given samples.
Here's the function to calculate the orientation vector using image moments:
Mat orientVec(Mat& im)
{
Moments m = moments(im);
double cov[4] = {m.mu20/m.m00, m.mu11/m.m00, m.mu11/m.m00, m.mu02/m.m00};
Mat covMat(2, 2, CV_64F, cov);
Mat evals, evecs;
eigen(covMat, evals, evecs);
return evecs.row(0);
}
Rotate and match sample images:
Mat im1 = imread(INPUT_FOLDER_PATH + string("WojUi.png"), 0);
Mat im2 = imread(INPUT_FOLDER_PATH + string("XbrsV.png"), 0);
// get the orientation vector
Mat v1 = orientVec(im1);
Mat v2 = orientVec(im2);
double angle = acos(v1.dot(v2))*180/CV_PI;
// rotate im2. try rotating with -angle and +angle. here using -angle
Mat rot = getRotationMatrix2D(Point(im2.cols/2, im2.rows/2), -angle, 1.0);
Mat im2Rot;
warpAffine(im2, im2Rot, rot, Size(im2.rows, im2.cols));
// add a border to rotated image
int borderSize = im1.rows > im2.cols ? im1.rows/2 + 1 : im1.cols/2 + 1;
Mat im2RotBorder;
copyMakeBorder(im2Rot, im2RotBorder, borderSize, borderSize, borderSize, borderSize,
BORDER_CONSTANT, Scalar(0, 0, 0));
// normalized cross-correlation
Mat& image = im2RotBorder;
Mat& templ = im1;
Mat nxcor;
matchTemplate(image, templ, nxcor, CV_TM_CCOEFF_NORMED);
// take the max
double max;
Point maxPt;
minMaxLoc(nxcor, NULL, &max, NULL, &maxPt);
// draw the match
Mat rgb;
cvtColor(image, rgb, CV_GRAY2BGR);
rectangle(rgb, maxPt, Point(maxPt.x+templ.cols-1, maxPt.y+templ.rows-1), Scalar(0, 255, 255), 2);
cout << "max: " << max << endl;
With -angle rotation in code, I get max = 0.758. Below is the rotated image in this case with the matching region.
Otherwise max = 0.293

Cumulative Homography Wrongly Scaling

I'm to build a panorama image of the ground covered by a downward facing camera (at a fixed height, around 1 metre above ground). This could potentially run to thousands of frames, so the Stitcher class' built in panorama method isn't really suitable - it's far too slow and memory hungry.
Instead I'm assuming the floor and motion is planar (not unreasonable here) and trying to build up a cumulative homography as I see each frame. That is, for each frame, I calculate the homography from the previous one to the new one. I then get the cumulative homography by multiplying that with the product of all previous homographies.
Let's say I get H01 between frames 0 and 1, then H12 between frames 1 and 2. To get the transformation to place frame 2 onto the mosaic, I need to get H01*H12. This continues as the frame count increases, such that I get H01*H12*H23*H34*H45*....
In code, this is something akin to:
cv::Mat previous, current;
// Init cumulative homography
cv::Mat cumulative_homography = cv::Mat::eye(3);
video_stream >> previous;
for(;;) {
video_stream >> current;
// Here I do some checking of the frame, etc
// Get the homography using my DenseMosaic class (using Farneback to get OF)
cv::Mat tmp_H = DenseMosaic::get_homography(previous,current);
// Now normalise the homography by its bottom right corner
tmp_H /= tmp_H.at<double>(2, 2);
cumulative_homography *= tmp_H;
previous = current.clone( );
}
It works pretty well, except that as the camera moves "up" in the viewpoint, the homography scale decreases. As it moves down, the scale increases again. This gives my panoramas a perspective type effect that I really don't want.
For example, this is taken on a few seconds of video moving forward then backward. The first frame looks ok:
The problem comes as we move forward a few frames:
Then when we come back again, you can see the frame gets bigger again:
I'm at a loss as to where this is coming from.
I'm using Farneback dense optical flow to calculate pixel-pixel correspondences as below (sparse feature matching doesn't work well on this data) and I've checked my flow vectors - they're generally very good, so it's not a tracking problem. I also tried switching the order of the inputs to find homography (in case I'd mixed up the frame numbers), still no better.
cv::calcOpticalFlowFarneback(grey_1, grey_2, flow_mat, 0.5, 6,50, 5, 7, 1.5, flags);
// Using the flow_mat optical flow map, populate grid point correspondences between images
std::vector<cv::Point2f> points_1, points_2;
median_motion = DenseMosaic::dense_flow_to_corresp(flow_mat, points_1, points_2);
cv::Mat H = cv::findHomography(cv::Mat(points_2), cv::Mat(points_1), CV_RANSAC, 1);
Another thing I thought it could be was the translation I include in the transformation to ensure my panorama is centred within the scene:
cv::warpPerspective(init.clone(), warped, translation*homography, init.size());
But having checked the values in the homography before the translation is applied, the scaling issue I mention is still present.
Any hints are gratefully received. There's a lot of code I could put in but it seems irrelevant, please do let me know if there's something missing
UPDATE
I've tried switching out the *= operator for the full multiplication and tried reversing the order the homographies are multiplied in, but no luck. Below is my code for calculating the homography:
/**
\brief Calculates the homography between the current and previous frames
*/
cv::Mat DenseMosaic::get_homography()
{
cv::Mat grey_1, grey_2; // Grayscale versions of frames
cv::cvtColor(prev, grey_1, CV_BGR2GRAY);
cv::cvtColor(cur, grey_2, CV_BGR2GRAY);
// Calculate the dense flow
int flags = cv::OPTFLOW_FARNEBACK_GAUSSIAN;
if (frame_number > 2) {
flags = flags | cv::OPTFLOW_USE_INITIAL_FLOW;
}
cv::calcOpticalFlowFarneback(grey_1, grey_2, flow_mat, 0.5, 6,50, 5, 7, 1.5, flags);
// Convert the flow map to point correspondences
std::vector<cv::Point2f> points_1, points_2;
median_motion = DenseMosaic::dense_flow_to_corresp(flow_mat, points_1, points_2);
// Use the correspondences to get the homography
cv::Mat H = cv::findHomography(cv::Mat(points_2), cv::Mat(points_1), CV_RANSAC, 1);
return H;
}
And this is the function I use to find the correspondences from the flow map:
/**
\brief Calculate pixel->pixel correspondences given a map of the optical flow across the image
\param[in] flow_mat Map of the optical flow across the image
\param[out] points_1 The set of points from #cur
\param[out] points_2 The set of points from #prev
\param[in] step_size The size of spaces between the grid lines
\return The median motion as a point
Uses a dense flow map (such as that created by cv::calcOpticalFlowFarneback) to obtain a set of point correspondences across a grid.
*/
cv::Point2f DenseMosaic::dense_flow_to_corresp(const cv::Mat &flow_mat, std::vector<cv::Point2f> &points_1, std::vector<cv::Point2f> &points_2, int step_size)
{
std::vector<double> tx, ty;
for (int y = 0; y < flow_mat.rows; y += step_size) {
for (int x = 0; x < flow_mat.cols; x += step_size) {
/* Flow is basically the delta between left and right points */
cv::Point2f flow = flow_mat.at<cv::Point2f>(y, x);
tx.push_back(flow.x);
ty.push_back(flow.y);
/* There's no need to calculate for every single point,
if there's not much change, just ignore it
*/
if (fabs(flow.x) < 0.1 && fabs(flow.y) < 0.1)
continue;
points_1.push_back(cv::Point2f(x, y));
points_2.push_back(cv::Point2f(x + flow.x, y + flow.y));
}
}
// I know this should be median, not mean, but it's only used for plotting the
// general motion direction so it's unimportant.
cv::Point2f t_median;
cv::Scalar mtx = cv::mean(tx);
t_median.x = mtx[0];
cv::Scalar mty = cv::mean(ty);
t_median.y = mty[0];
return t_median;
}

It turns out this was because my viewpoint was close to the features, meaning that the non-planarity of the tracked features was causing skew to the homography. I managed to prevent this (it's more of a hack than a method...) by using estimateRigidTransform instead of findHomography, as this does not estimate for perspective variations.
In this particular case, it makes sense to do so, as the view does only ever undergo rigid transformations.

Matching thermographic / non-thermographic images with OpenCV feature detectors

I’m currently working on building software which can match infrared and non-infrared images taken from a fixed point using a thermographic camera.
The use case is the following: A picture is taken from using a tripod of a fixed point using an infrared thermographic camera and a standard camera. After taking the pictures, the photographer wants to match images from each camera. There will be some scenarios where an image is taken with only one camera as the other image type is unnecessary. Yes, it may be possible for the images to be matched using timestamps, but the end-user demands they be matched using computer vision.
I've looked at other image matching posts on StackOverflow -- they have often focused on using histogram matching and feature detectors. Histogram matching is not an option here, as we cannot match colors between the two image types. As a result, I've developed an application which does feature detection. In addition to standard feature detection, I’ve also added some logic which says that two keypoints cannot be matching if they are not within a certain margin of each other (a keypoint on the far left of the query image cannot match a keypoint on the far right of the candidate image) -- this process occurs in stage 3 of the code below.
To give you an idea of the current output, here is a valid and invalid match produced -- note the thermographic image is on the left. My objective is to improve the accuracy of the matching process.
Valid match:
Invalid match:
Here is the code:
// for each candidate image specified on the command line, compare it against the query image
Mat img1 = imread(argv[1], CV_LOAD_IMAGE_GRAYSCALE); // loading query image
for(int candidateImage = 0; candidateImage < (argc - 2); candidateImage++) {
Mat img2 = imread(argv[candidateImage + 2], CV_LOAD_IMAGE_GRAYSCALE); // loading candidate image
if(img1.empty() || img2.empty())
{
printf("Can't read one of the images\n");
return -1;
}
// detecting keypoints
SiftFeatureDetector detector;
vector<KeyPoint> keypoints1, keypoints2;
detector.detect(img1, keypoints1);
detector.detect(img2, keypoints2);
// computing descriptors
SiftDescriptorExtractor extractor;
Mat descriptors1, descriptors2;
extractor.compute(img1, keypoints1, descriptors1);
extractor.compute(img2, keypoints2, descriptors2);
// matching descriptors
BFMatcher matcher(NORM_L1);
vector< vector<DMatch> > matches_stage1;
matcher.knnMatch(descriptors1, descriptors2, matches_stage1, 2);
// use nndr to eliminate weak matches
float nndrRatio = 0.80f;
vector< DMatch > matches_stage2;
for (size_t i = 0; i < matches_stage1.size(); ++i)
{
if (matches_stage1[i].size() < 2)
continue;
const DMatch &m1 = matches_stage1[i][0];
const DMatch &m2 = matches_stage1[i][3];
if(m1.distance <= nndrRatio * m2.distance)
matches_stage2.push_back(m1);
}
// eliminate points which are too far away from each other
vector<DMatch> matches_stage3;
for(int i = 0; i < matches_stage2.size(); i++) {
Point queryPt = keypoints1.at(matches_stage2.at(i).queryIdx).pt;
Point trainPt = keypoints2.at(matches_stage2.at(i).trainIdx).pt;
// determine the lowest number here
int lowestXAxis;
int greaterXAxis;
if(queryPt.x <= trainPt.x) { lowestXAxis = queryPt.x; greaterXAxis = trainPt.x; }
else { lowestXAxis = trainPt.x; greaterXAxis = queryPt.x; }
int lowestYAxis;
int greaterYAxis;
if(queryPt.y <= trainPt.y) { lowestYAxis = queryPt.y; greaterYAxis = trainPt.y; }
else { lowestYAxis = trainPt.y; greaterYAxis = queryPt.y; }
// determine if these points are acceptable
bool acceptable = true;
if( (lowestXAxis + MARGIN) < greaterXAxis) { acceptable = false; }
if( (lowestYAxis + MARGIN) < greaterYAxis) { acceptable = false; }
if(acceptable == false) { continue; }
//// it's acceptable -- provide details, perform input
matches_stage3.push_back(matches_stage2.at(i));
}
// output how many individual matches were found for this training image
cout << "good matches found for candidate image # " << (candidateImage+1) << " = " << matches_stage3.size() << endl;
I used this sites code as an example. The problem I’m having is that the feature detection is not reliable, and I seem to be missing the purpose of the NNDR ratio. I understand that I am finding K possible matches for each point within the query image and that I have K = 2. But I don’t understand the purpose of this part within the example code:
vector< DMatch > matches_stage2;
for (size_t i = 0; i < matches_stage1.size(); ++i)
{
if (matches_stage1[i].size() < 2)
continue;
const DMatch &m1 = matches_stage1[i][0];
const DMatch &m2 = matches_stage1[i][1];
if(m1.distance <= nndrRatio * m2.distance)
matches_stage2.push_back(m1);
}
Any ideas on how I can improve this further? Any advice would be appreciated as always.

The validation you currently use
First stage
First of all, let's talk about the part of the code that you don't understand. The idea is to keep only "strong matches". Actually, your call of knnMatch finds, for each descriptor, the best two correspondences with respect to the Euclidean distance "L2"(*). This does not mean at all that these are good matches in reality, but only that those feature points are quite similar.
Let me try to explain your validation now, considering only one feature point in image A (it generalizes to all of them):
You match the descriptor of this point against image B
You get the two best correspondences with respect to the Euclidean distance (i.e. you get the two most similar points in image B)
If the distance from your point to the best correspondence is much smaller than the one from your point to the second-best correspondence, then you assume that it is a good match. In other words, there was only one point in image B that was really similar (i.e. had a small Euclidean distance) to the point in image A.
If both matches are too similar (i.e. !(m1.distance <= nndrRatio * m2.distance)), then you cannot really discriminate between them and you don't consider the match.
This validation has some major weaknesses, as you have probably observed:
Firstly, if the best matches you get from knnMatch are both terribly bad, then the best of those might be accepted anyway.
It does not take the geometry into account. Therefore, a point far on the left of the image might be similar to a point far on the right, even though in reality they clearly don't match.
* EDIT: Using SIFT, you describe each feature point in your image using a floating-point vector. By computing the Euclidean distance between two vectors, you know how similar they are. If both vectors are exactly the same, then the distance is zero. The smaller the distance, the more similar the points. But this is not geometric: a point on the left-hand side of your image might look similar to a point in the right-hand side. So you first find the pairs of points that look similar (i.e. "This point in A looks similar to this point in B because the Euclidean distance between their feature vectors is small") and then you need to verify that this match is coherent (i.e. "It is possible that those similar points are actually the same because they both are on the left-hand side of my image" or "They look similar, but that is incoherent because I know that they must lie on the same side of the image and they don't").
Second stage
What you do in your second stage is interesting since it considers the geometry: knowing that both images were taken from the same point (or almost the same point?), you eliminate matches that are not in the same region in both images.
The problem I see with this is that if both images weren't taken at the exact same position with the very same angle, then it won't work.
Proposition to improve your validation further
I would personally work on the second stage. Even though both images aren't necessarily exactly the same, they describe the same scene. And you can take advantage of the geometry of it.
The idea is that you should be able to find a transformation from the first image to the second one (i.e. the way in which a point moved from image A to image B is actually linked to the way all of the points moved). And in your situation, I would bet that a simple homography is adapted.
Here is my proposition:
Compute the matches using knnMatch, and keep stage 1 (you might want to try removing it later and observe the consequences)
Compute the best possible homography transform between those matches using cv::findHomography (choose the RANSAC algorithm).
findHomography has a mask output that will give you the inliers (i.e. the matches that were used to compute the homography transform).
The inliers will most probably be good matches since there will be coherent geometrically.
EDIT: I just found an example using findHomography here.

Haven't tried it with infrared/visible light photography, but mutual information metrics usually do a reasonable job when you have very different histograms for similar images.
Depending on how fast you need this to be and how many candidates there are, one way to leverage this would be to register the images using a mutual information metric and find the image pair where you end up with the lowest error. It would probably be a good idea to downsample the images to speed things up and reduce noise-sensitivity.

After extracting keypoints,forming descriptors and matching use some outlier removal algorithm like RANSAC. Opencv provide RANSAC with findHomography function.you can see the implementation.I have used this with SURF and it gave me reasonably good results.

Ideas:
a) use the super-resolution module to improve your input (OpenCV245).
b) use maximally stable local color regions as matching features (MSER).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js