Image Stitching details with OpenCV - c++

I am trying to get deep into stitching. I am using cv::detail.
I am trying to follow this example:
I roughly understand the stitching pipeline.
there is a function matchesGraphAsString() which return a graph. I am wondering how does it even compute this graph. Further, what is the dfination of confidence interval in this case.
The output is in DOT format and a sample graph looks like
graph matches_graph{
"15.jpg" -- "13.jpg"[label="Nm=75, Ni=50, C=1.63934"];
"15.jpg" -- "12.jpg"[label="Nm=47, Ni=28, C=1.26697"];
"15.jpg" -- "14.jpg"[label="Nm=149, Ni=117, C=2.22011"];
"11.jpg" -- "13.jpg"[label="Nm=71, Ni=52, C=1.77474"];
"11.jpg" -- "9.jpg"[label="Nm=46, Ni=37, C=1.69725"];
"11.jpg" -- "10.jpg"[label="Nm=87, Ni=73, C=2.14076"];
"9.jpg" -- "8.jpg"[label="Nm=122, Ni=99, C=2.21973"];
}
What does label, Nm, and Ni mean here? The official document seems to be lacking these details.

This is a very interesting question indeed. As #hatboyzero pointed out, the meaning of the variables is reasonably straightforward:
Nm is the number of matches (in the overlapping region, so obvious outliers have been removed already).
Ni is the number of inliers after finding a homography with Ransac.
C is the confidence that the two images are a match.
Background to matching
Building a panorama is done by finding interest points in all images and computing descriptors for them. These descriptors, like SIFT, SURF and ORB, were developed so that the same parts of an image could be detected. They are just a medium-dimensional vector (64 or 128 dimensions are typical). By computing the L2 or some other distance between two descriptors, matches can be found. How many matches in a pair of images are found is described by the term Nm.
Notice that so far, the matching has only been done through appearance of image regions around interest points. Very typically, many of these matches are plain wrong. This can be because the descriptor looks the same (think: repetitive object like window sills on a multi-window building, or leaves on a tree) or because the descriptor is just a bit too uninformative.
The common solution is to add geometric constraints: The image pair was taken from the same position with the same camera, therefore points that are close in one image must be close in the other image, too. More specifically, all the points must have undergone the same transformation. In the panorama case where the camera was rotated around the nodal point of the camera-lens system this transformation must have been a 2D homography.
Ransac is the gold standard algorithm to find the best transformation and all the matches that are consistent with this tranformation. The number of these consistent matches is called Ni. Ransac works by randomly selecting in this case 4 matches (see paper sect 3.1) and fitting a homography to these four matches. Then, count how many matches from all possible matches would agree with this homography. Repeat 500 times (see paper) and at the end take the model that had the most inliers. Then re-compute the model with all inliers. The name of the algorithm comes from RANdom SAmple Consensus: RanSaC.
Confidence-Term
The question for me was, about this mysterious confidence. I quickly found where it was calculated.
From stitching/sources/matches.cpp:
// These coeffs are from paper M. Brown and D. Lowe. "Automatic Panoramic Image Stitching
// using Invariant Features"
matches_info.confidence = matches_info.num_inliers / (8 + 0.3 * matches_info.matches.size());
// Set zero confidence to remove matches between too close images, as they don't provide
// additional information anyway. The threshold was set experimentally.
matches_info.confidence = matches_info.confidence > 3. ? 0. : matches_info.confidence;
The mentioned paper
has in section 3.2 ("Probabilistic Model for Image Match Verification") some more details to what this means.
Reading this section a few things stood out.
There are a lot of variables (mostly probabilities) in their model. These values are defined in the paper without any justification. Below is the key sentence:
Though in practice we have chosen values for p0, p1, p(m = 0), p(m = 1) and pmin, they could in principle be learnt from the data.
So, this is just a theoretical exercise as the the parameters have been plucked out of thin air. Notice the could in principle be learnt.
The paper has in equation 13 the confidence calculation. If read correctly, it means that matches_info.confidence indicates a proper match between two images iff its value is above 1.
I don't see any justification in the removal of a match (setting confidence to 0) when the confidence is above 3. It just means that there are very little outliers. I think the programmers thought that a high number of matches that turn out to be outlier means that the images overlap a great deal, but this isn't provided by algorithms behind this. (Simply, the matchings are based on appearance of features.)

Glancing at the OpenCV source code available online, I gather that they mean the following:
Nm - Number of pairwise matches
Ni - Number of geometrically consistent matches
C - Confidence two images are from the same panorama
I'm basing my assumptions on a snippet from the body of matchesGraphAsString in modules/stitching/src/motion_estimators.cpp from version 2.4.2 of the OpenCV source code. I.e.
str << "\"" << name_src << "\" -- \"" << name_dst << "\""
<< "[label=\"Nm=" << pairwise_matches[pos].matches.size()
<< ", Ni=" << pairwise_matches[pos].num_inliers
<< ", C=" << pairwise_matches[pos].confidence << "\"];\n";
Additionally, I'm also looking at the documentation for detail::MatchesInfo for information about the Ni and C terms.

Related

Error in calculating exact nearest neighbors in radius with FLANN

I am trying to find the exact number of neighbour nodes in a big 3D points dataset. The goal is for each point of the dataset to retrieve all the possible neighbours in a region with a given radius. FLANN ensures that for lower dimensional data can retrieve the exact neighbors while comparing with brute force search it seems to not be the case. The neighbors are essential for further calculations and therefore I need the exact number. I tested increasing the radius a little bit but doesn't seem to be this the problem. Is anyone aware how to calculate the exact neighbors with FLANN or other C++ library?
The code:
// All nodes to be tested for inclusion in support domain.
flann::Matrix<double> query_nodes = flann::Matrix<double>(&nodes_pos[0].x, nodes_pos.size(), 3);
// Set default search parameters
flann::SearchParams search_parameters = flann::SearchParams();
search_parameters.checks = -1;
search_parameters.sorted = false;
search_parameters.use_heap = flann::FLANN_True;
flann::KDTreeSingleIndexParams index_parameters = flann::KDTreeSingleIndexParams();
flann::KDTreeSingleIndex<flann::L2_3D<double> > index(query_nodes, index_parameters);
index.buildIndex();
//FLANN uses L2 for radius search.
double l2_radius = (this->support_layer_*grid.spacing)*(this->support_layer_*grid.spacing);
double extension = l2_radius/10.;
l2_radius+= extension;
index.radiusSearch(query_nodes, indices, dists, l2_radius, search_parameters);
Try nanoflann. It is designed for low dimensional spaces and gives exact nearest neighbors. Furthermore, it is just one header file that you can either "install" or just copy to your project.
You should check page 6+ from the flann-manual, to fine-tune your search parameters, such as target_precision, which should be set to 1, for "maximum" accuracy.
That parameter is often found as epsilon (ε) in Approximate Nearest Neighbor Search (ANNS), which is used in high dimensional spaces, in order to (try) to beat the curse of dimensionality. FLANN is usually used in 128 dimensions, not 3, as far as I can tell, which may explain the bad performance you are experiencing.
A c++ library that works well in 3 dimensions is CGAL. However, it's much larger than FLANN, because it is a library for computational geometry, thus it provides functionality for many problems, not just NNS.

Backpropagation 2-Dimensional Neuron Network C++

I am learning about Two Dimensional Neuron Network so I am facing many obstacles but I believe it is worth it and I am really enjoying this learning process.
Here's my plan: To make a 2-D NN work on recognizing images of digits. Images are 5 by 3 grids and I prepared 10 images from zero to nine. For Example this would be number 7:
Number 7 has indexes 0,1,2,5,8,11,14 as 1s (or 3,4,6,7,9,10,12,13 as 0s doesn't matter) and so on. Therefore, my input layer will be a 5 by 3 neuron layer and I will be feeding it zeros OR ones only (not in between and the indexes depends on which image I am feeding the layer).
My output layer however will be one dimensional layer of 10 neurons. Depends on which digit was recognized, a certain neuron will fire a value of one and the rest should be zeros (shouldn't fire).
I am done with implementing everything, I have a problem in computing though and I would really appreciate any help. I am getting an extremely high error rate and an extremely low (negative) output values on all output neurons and values (error and output) do not change even on the 10,000th pass.
I would love to go further and post my Backpropagation methods since I believe the problem is in it. However to break down my work I would love to hear some comments first, I want to know if my design is approachable.
Does my plan make sense?
All the posts are speaking about ranges ( 0->1, -1 ->+1, 0.01 -> 0.5 etc ), will it work for either { 0 | .OR. | 1 } on the output layer and not a range? if yes, how can I control that?
I am using TanHyperbolic as my transfer function. Does it make a difference between this and sigmoid, other functions.. etc?
Any ideas/comments/guidance are appreciated and thanks in advance
Well, by the description given above, I think that the design and approach taken it's correct! With respect to the choice of the activation function, remember that those functions help to get the neurons which have the largest activation number, also, their algebraic properties, such as an easy derivative, help with the definition of Backpropagation. Taking this into account, you should not worry about your choice of activation function.
The ranges that you mention above, correspond to a process of scaling of the input, it is better to have your input images in range 0 to 1. This helps to scale the error surface and help with the speed and convergence of the optimization process. Because your input set is composed of images, and each image is composed of pixels, the minimum value and and the maximum value that a pixel can attain is 0 and 255, respectively. To scale your input in this example, it is essential to divide each value by 255.
Now, with respect to the training problems, Have you tried checking if your gradient calculation routine is correct? i.e., by using the cost function, and evaluating the cost function, J? If not, try generating a toy vector theta that contains all the weight matrices involved in your neural network, and evaluate the gradient at each point, by using the definition of gradient, sorry for the Matlab example, but it should be easy to port to C++:
perturb = zeros(size(theta));
e = 1e-4;
for p = 1:numel(theta)
% Set perturbation vector
perturb(p) = e;
loss1 = J(theta - perturb);
loss2 = J(theta + perturb);
% Compute Numerical Gradient
numgrad(p) = (loss2 - loss1) / (2*e);
perturb(p) = 0;
end
After evaluating the function, compare the numerical gradient, with the gradient calculated by using backpropagation. If the difference between each calculation is less than 3e-9, then your implementation shall be correct.
I recommend to checkout the UFLDL tutorials offered by the Stanford Artificial Intelligence Laboratory, there you can find a lot of information related to neural networks and its paradigms, it's worth to take look at it!
http://ufldl.stanford.edu/wiki/index.php/Main_Page
http://ufldl.stanford.edu/tutorial/

OpenCV: findHomography generating an empty matrix

When using findHomography():
Mat H = findHomography( obj, scene, cv::RANSAC , 3, hom_mask, 2000, 0.995 );
Sometimes, for some image, the resulting H matrix stays empty (H is a UINT8, 1x0x0). However, there is clearly a match between both images (and it looks like good keypoint matches are detected), and just a moment before, with two similar images with similar keypoint responses, a relevant matrix was generated. Input parameters "obj" and "scene" are both a vector of Point2f containing various coordinates.
Is this a common issue? Or do you think a bug might lurk somewhere? Personally, I have processed hundreds of images where a match exists and while I have seen sometime poor matches, it is the first time I get an empty matrix...
EDIT : This said, even if my eyes think that there should be a match in the image pairs, I realize that it might confuses some portion of the image with an other one and that maybe there is indeed no "good" match.
So my question would be: How does findHomography() behave when it is unable to find a suitable Homography? Does it return an empty matrix or will it always give a homography, albeit a very poor one? I just want to know if I encounter standard behaviour or if there is a bug in my own code.
Well you see, cv::findHomography() function could return empty homography matrix (0 cols x 0 rows) starting approximately from 2.4.5 release.
According to some opinion this seems happen only when cv::RANSAC flag is passed.
See the issue reported here:
It likely happened because we put in new experimental version of
Levenberg-Marquardt solver, which does not work that well (maybe due
to some bugs)
I suggest to check the computed homography before using it anywhere:
cv::Mat h = cv::findHomography(...)
if (!h.empty())
{
// Use it
}

Parameter of BackgroundSubtractorMOG2

I have Problem understanding all Parameter of backgroundsubtractormog2.
I looked in the code (located in bfgf_gaussmix2.cpp), but don't see the connection to the mentioned paper. For exmaple is Tb = varThreshold, but what is the name of Tb in the paper?
I am especially interested in the fat marked parameter.
Let's start with the easy parameter [my remarks]:
int nmixtures
Maximum allowed number of mixture components. Actual number is determined dynamically per pixel.
[set 0 for GMG]
uchar nShadowDetection
The value for marking shadow pixels in the output foreground mask. Default value is 127.
float fTau
Shadow threshold. The shadow is detected if the pixel is a darker version of the background. Tau is a threshold defining how much darker the shadow can be. Tau= 0.5 means that if a pixel is more than twice darker then it is not shadow.
Now to the ones i don't understand:
float backgroundRatio
Threshold defining whether the component is significant enough to be included into the background model ( corresponds to TB=1-cf from the paper??which paper??). cf=0.1 => TB=0.9 is default. For alpha=0.001, it means that the mode should exist for approximately 105 frames before it is considered foreground.
float varThresholdGen
Threshold for the squared Mahalanobis distance that helps decide when a sample is close to the existing components (corresponds to Tg). If it is not close to any component, a new component is generated. 3 sigma => Tg=3*3=9 is default. A smaller Tg value generates more components. A higher Tg value may result in a small number of components but they can grow too large. [i don't understand a word of this]
In the Constructor the variable varThreshold is used. Is it the same as varThresholdGen?
Threshold on the squared Mahalanobis distance to decide whether it is well described by the background model (see Cthr??). This parameter does not affect the background update. A typical value could be 4 sigma, that is, varThreshold=4*4=16; (see Tb??).
float fVarInit
Initial variance for the newly generated components. It affects the speed of adaptation. The parameter value is based on your estimate of the typical standard deviation from the images. OpenCV uses 15 as a reasonable value.
float fVarMin
Parameter used to further control the variance.
float fVarMax
Parameter used to further control the variance.
float fCT
Complexity reduction parameter. This parameter defines the number of samples needed to accept to prove the component exists. CT=0.05 is a default value for all the samples. By setting CT=0 you get an algorithm very similar to the standard Stauffer&Grimson algorithm.
Someone asked pretty much the same question on the OpenCV website, but without an answer.
Well, I don't think anyone could tell you which parameter is what if you don't know the details of the algorithm that you are using. Besides, you should not need anyone to tell you which parameter is what if you know the details of the algorithm. I'm telling this for detailed parameters (fCT, fVarMax, etc.) not for straightforward ones (nmixtures, nShadowDetection, etc.).
So, I think you should read the papers referenced in the documentation. Here are the links for the papers 1, 2, 3.
And also you should read this paper as well, which is the beginning of background estimation.
After reading these papers and checking out the code with, I'm sure you will understand what those parameters are.
Good luck!

Determine difference in stops between images with no EXIF data

I have a set of images of the same scene but shot with different exposures. These images have no EXIF data so there is no way to extract useful info like f-stop, shutter speed etc.
What I'm trying to do is to determine the difference in stops between the images i.e. Image1 is +1.3 stops of Image0.
My current approach is to first calculate luminance from the image's RGB values using the equation
L = 0.2126 * R + 0.7152 * G + 0.0722 * B
I've seen different numbers being used in the equation but generally it should not affect the end result L too much.
After that I derive the log-average luminance of the image.
exp(avg of log(luminance of image))
But somehow the log-avg luminance doesn't seem to give much indication on exposure difference btw the images.
Any ideas on how to determine exposure difference?
edit: on c/c++
You have to generally solve two problems:
1. Linearize your image data
(In case it's not obvious what is meant: two times more light collected by your pixel shall result in two times the intensity value in your linearized image.)
Your image input might be (sufficiently) linearized already -> you may skip to part 2. If your content came from a camera and it's a JPEG, then this will most certainly not be the case.
The real 'solution' to this problem is finding the camera response function, which you want to invert and apply to your image data to get linear intensity values. This is by no means a trivial task. The EMoR model is widely used in all sorts of software (Photoshop, PTGui, Photomatix, etc.) to describe camera response functions. Some open source software solving this problem (but using a different model iirc) is PFScalibrate.
Having that said, you may get away with a simple inverse gamma application. A rough 'gestimation' for the right gamma value might be found by doing this:
capture an evenly lit, static scene with two exposure times e and e/2
apply a couple of inverse gamma transforms (e.g. for 1.8 to 2.4 in 0.1 steps) on both images
multiply all the short exposure images with 2.0 and subtract them from the respective long exposure images
pick the gamma that lead to the smallest overall difference
2. Find the actual difference of irradiation in stops, i.e. log2(scale factor)
Presuming the scene was static (no moving objects or camera), this is relatively easy:
sum1 = sum2 = 0
foreach pixel pair (p1,p2) from the two images:
if p1 or p2 is close to 0 or 255:
skip this pair
sum1 += p1 and sum2 += p2
return log2(sum1 / sum2)
On large images this will certainly work just as well and a lot faster if you sub-sample the images.
If the camera was static but the scene was not (moving objects), this starts to work less well. I produced acceptable results in this case by simply repeating the above procedure several times and use the output of the previous run as an estimate for the correct scale factor and then discard pixel pairs who's quotient is too far away from the current estimate. So basically replacing the above if line with the following:
if <see above> or if abs(log2(p1/p2) - estimate) > 0.5:
I'd stop the repetition after a fixed number of iterations or if two consecutive estimates are sufficiently close to each other.
EDIT: A note about conversion to luminance
You don't need to do that at all (as Tony D mentioned already) and if you insist, then do it after the linearization step (as Mark Ransom noted). In a perfect setting (static scene, no noise, no de-mosaicing, no quantization) every channel of every pixel would have the same ratio p1/p2 (if neither is saturated). Therefore the relative weighting of the different channels is irrelevant. You may sum over all pixels/channels (weighing R, G and B equally) or maybe only use the green channel.