ransac with homography vs 8/5 point algorithms - computer-vision

Im beginning to learn computer vision and I'm confused on the difference between the two.
I know that the 8 point algorithm is used to compute the fundamental matrix and the 5 point algorithm is used to compute the essential matrix. Both of which can be used to determine the relative camera pose.
I also found that the relative camera pose can be determined using ransac with homography https://inspirit.github.io/jsfeat/#multiview in the ransac method
Is there a difference between using ransac with homography as opposed to using the algorithms?

First of all, note that you still need RANSAC with the 8-point or 5-point algorithms, since in practice outliers are to be expected in the matching process.
I think the main downside of pose from homography is that the point matches you use need to be coplanar. Additionaly, if I'm not mistaken, in a scene with more than one plane, you might get different homographies depending on which planes you select in the scene. That is why applying a homography to correct perspective adds distortion to some other parts of the image (see the example in this video). So in complex scenes (e.g. urban environements) where matching is more difficult, I'd use one of the 8-point or the 5-point algorithms.
Note that you can also recover the relative pose directly (up to scale, obviously), and compute the essential from that (see this paper). It's easier than computing the fundamental/essential and then extracting relative pose.

Related

OpenCV triangulatePoints varying distance

I am using OpenCV's triangulatePoints function to determine 3D coordinates of a point imaged by a stereo camera.
I am experiencing that this function gives me different distance to the same point depending on angle of camera to that point.
Here is a video:
https://www.youtube.com/watch?v=FrYBhLJGiE4
In this video, we are tracking the 'X' mark. In the upper left corner info is displayed about the point that is being tracked. (Youtube dropped the quality, the video is normally much sharper. (2x1280) x 720)
In the video, left camera is the origin of 3D coordinate system and it's looking in positive Z direction. Left camera is undergoing some translation, but not nearly as much as the triangulatePoints function leads to believe. (More info is in the video description.)
Metric unit is mm, so the point is initially triangulated at ~1.94m distance from the left camera.
I am aware that insufficiently precise calibration can cause this behaviour. I have ran three independent calibrations using chessboard pattern. The resulting parameters vary too much for my taste. ( Approx +-10% for focal length estimation).
As you can see, the video is not highly distorted. Straight lines appear pretty straight everywhere. So the optimimum camera parameters must be close to the ones I am already using.
My question is, is there anything else that can cause this?
Can a convergence angle between the two stereo cameras can have this effect? Or wrong baseline length?
Of course, there is always a matter of errors in feature detection. Since I am using optical flow to track the 'X' mark, I get subpixel precision which can be mistaken by... I don't know... +-0.2 px?
I am using the Stereolabs ZED stereo camera. I am not accessing the video frames using directly OpenCV. Instead, I have to use the special SDK I acquired when purchasing the camera. It has occured to me that this SDK I am using might be doing some undistortion of its own.
So, now I wonder... If the SDK undistorts an image using incorrect distortion coefficients, can that create an image that is neither barrel-distorted nor pincushion-distorted but something different altogether?
The SDK provided with the ZED Camera performs undistortion and rectification of images. The geometry model is based on the same as openCV :
intrinsic parameters and distortion parameters for both Left and Right cameras.
extrinsic parameters for rotation/translation between Right and Left.
Through one of the tool of the ZED ( ZED Settings App), you can enter your own intrinsic matrix for Left/Right and distortion coeff, and Baseline/Convergence.
To get a precise 3D triangulation, you may need to adjust those parameters since they have a high impact on the disparity you will estimate before converting to depth.
OpenCV gives a good module to calibrate 3D cameras. It does :
-Mono calibration (calibrateCamera) for Left and Right , followed by a stereo calibration (cv::StereoCalibrate()). It will output Intrinsic parameters (focale, optical center (very important)), and extrinsic (Baseline = T[0], Convergence = R[1] if R is a 3x1 matrix). the RMS (return value of stereoCalibrate()) is a good way to see if the calibration has been done correctly.
The important thing is that you need to do this calibration on raw images, not by using images provided with the ZED SDK. Since the ZED is a standard UVC Camera, you can use opencv to get the side by side raw images (cv::videoCapture with the correct device number) and extract Left and RIght native images.
You can then enter those calibration parameters in the tool. The ZED SDK will then perform the undistortion/rectification and provide the corrected images. The new camera matrix is provided in the getParameters(). You need to take those values when you triangulate, since images are corrected as if they were taken from this "ideal" camera.
hope this helps.
/OB/
There are 3 points I can think of and probably can help you.
Probably the least important, but from your description you have separately calibrated the cameras and then the stereo system. Running an overall optimization should improve the reconstruction accuracy, as some "less accurate" parameters compensate for the other "less accurate" parameters.
If the accuracy of reconstruction is important to you, you need to have a systematic approach to reducing it. Building an uncertainty model, thanks to the mathematical model, is easy and can write a few lines of code to build that for you. Say you want to see if the 3d point is 2 meters away, at a particular angle to the camera system, and you have a specific uncertainty on the 2d projections of the 3d point, it's easy to backproject the uncertainty to the 3d space around your 3d point. By adding uncertainty to the other parameters of the system then you can see which ones are more important and need to have lower uncertainty.
This inaccuracy is inherent in the problem and the method you're using.
First if you model the uncertainty you will see the reconstructed 3d points further away from the center of cameras have a much higher uncertainty. The reason is that the angle <left-camera, 3d-point, right-camera> is narrower. I remember the MVG book had a good description of this with a figure.
Second, if you look at the implementation of triangulatePoints you see that the pseudo-inverse method is implemented using SVD to construct the 3d point. That can lead to many issues, which you probably remember from linear algebra.
Update:
But I consistently get larger distance near edges and several times
the magnitude of the uncertainty caused by the angle.
That's the result of using pseudo-inverse, a numerical method. You can replace that with a geometrical method. One easy method is to back-project the 2d-projections to get 2 rays in 3d space. Then you want to find where the intersect, which doesn't happen due to the inaccuracies. Instead you want to find the point where the 2 rays have the least distance. Without considering the uncertainty you will consistently favor a point from the set of feasible solutions. That's why with pseudo inverse you don't see any fluctuation but a gross error.
Regarding the general optimization, yes, you can run an iterative LM optimization on all the parameters. This is the method used in applications like SLAM for autonomous vehicles where accuracy is very important. You can find some papers by googling bundle adjustment slam.

Camera pose estimation

I am trying to write a program from scratch that can estimate the pose of a camera. I am open to any programming language and using inbuilt functions/methods for feature detection...
I have been exploring different ways of estimating pose like SLAM, PTAM, DTAM etc... but I don't really need need tracking and mapping, I just need the pose.
Can any of you suggest an approach or any resource that can help me ? I know what pose is and a rough idea of how to estimate it but I am unable to find any resources that explain how it can be done.
I was thinking of starting with a video recorded, extracting features from the video and then using these features and geometry to estimate the pose.
(Please forgive my naivety, I am not a computer vision person and am fairly new to all of this)
In order to compute a camera pose, you need to have a reference frame that is given by some known points in the image.
These known points come for example from a calibration pattern, but can also be some known landmarks in your images (for example, the 4 corners of teh base of Gizeh pyramids).
The problem of estimating the pose of the camera given known landmarks seen by the camera (ie, finding 3D position from 2D points) is classically known as PnP.
OpenCV provides you a ready-made solver for this problem.
However, you need first to calibrate your camera, ie, you need to determine what makes it unique.
The parameters that you need to estimate are called intrinsic parameters, because they will depend on the camera focal length, sensor size... but not on the camera location or orientation.
These parameters will mathematically explain how world points are projected onto your camera sensor frame.
You can estimate them from known planar patterns (again, OpenCV has some ready-made functions for that).
Generally, you can extract the pose of a camera only relative to a given reference frame.
It is quite common to estimate the relative pose between one view of a camera to another view.
The most general relationship between two views of the same scene from two different cameras, is given by the fundamental matrix (google it).
You can calculate the fundamental matrix from correspondences between the images. For example look in the Matlab implementation:
http://www.mathworks.com/help/vision/ref/estimatefundamentalmatrix.html
After calculating this, you can use a decomposition of the fundamental matrix in order to get the relative pose between the cameras. (Look here for example: http://www.daesik80.com/matlabfns/function/DecompPMatQR.m).
You can work a similar procedure in case you have a calibrated camera, and then you need the Essential matrix instead of fundamnetal.

OpenCV findFundamentalMat very unstable and sensitive

I'm working on a Project for my University where we want a Quadrcopter to stabilize himself with his camera. Unfortunately the Fundamental matrix reacts very sensible to little changes within the featurpoints, i'll give you examples later on.
I think my matching already works pretty good thanks to ocv.
I'm using SURF Features and match them with the knn-Method:
SurfFeatureDetector surf_detect;
surf_detect = SurfFeatureDetector(400);
//detect keypoints
surf_detect.detect(fr_one.img, fr_one.kp);
surf_detect.detect(fr_two.img, fr_two.kp);
//extract keypoints
SurfDescriptorExtractor surf_extract;
surf_extract.compute(fr_one.img, fr_one.kp, fr_one.descriptors);
surf_extract.compute(fr_two.img, fr_two.kp, fr_two.descriptors);
//match keypoints
vector<vector<DMatch> > matches1,matches2;
vector<DMatch> symMatches,goodMatches;
FlannBasedMatcher flann_match;
flann_match.knnMatch(fr_one.descriptors, fr_two.descriptors, matches1,2);
flann_match.knnMatch(fr_two.descriptors, fr_one.descriptors, matches2,2);
//test matches in both ways
symmetryTest(matches1,matches2,symMatches);
std::vector<cv::Point2f> points1, points2;
for (std::vector<cv::DMatch>::const_iterator it= symMatches.begin();
it!= symMatches.end(); ++it)
{
//left keypoints
float x= fr_one.kp[it->queryIdx].pt.x;
float y= fr_one.kp[it->queryIdx].pt.y;
points1.push_back(cv::Point2f(x,y));
//right keypoints
x = fr_two.kp[it->trainIdx].pt.x;
y = fr_two.kp[it->trainIdx].pt.y;
points2.push_back(cv::Point2f(x,y));
}
//kill outliers with ransac
vector<uchar> inliers(points1.size(),0);
findFundamentalMat(Mat(points1),Mat(points2),
inliers,CV_FM_RANSAC,3.f,0.99f);
std::vector<uchar>::const_iterator
itIn= inliers.begin();
std::vector<cv::DMatch>::const_iterator
itM= symMatches.begin();
for ( ;itIn!= inliers.end(); ++itIn, ++itM)
{
if (*itIn)
{
goodMatches.push_back(*itM);
}
}
Now i want to compute the Fundamental Matrix with these matches. I'm using the 8POINT method for this example - i already tried it with LMEDS and RANSAC - there it only get's worse because there are more matches which change.
vector<int> pointIndexes1;
vector<int> pointIndexes2;
for (vector<DMatch>::const_iterator it= goodMatches.begin();
it!= goodMatches.end(); ++it) {
pointIndexes1.push_back(it->queryIdx);
pointIndexes2.push_back(it->trainIdx);
}
vector<Point2f> selPoints1, selPoints2;
KeyPoint::convert(fr_one.kp,selPoints1,pointIndexes1);
KeyPoint::convert(fr_two.kp,selPoints2,pointIndexes2);
Mat F = findFundamentalMat(Mat(selPoints1),Mat(selPoints2),CV_FM_8POINT);
When i call these calculations within a loop on the same pair of images the result of F varies very much - theres no way to extract movement from such calculations.
I generated an example where i filtered out some matches so that you can see the effect i mentioned for yourselves.
http://abload.de/img/div_c_01ascel.png
http://abload.de/img/div_c_02zpflj.png
Is there something wrong with my code or do i have to think about other reasons like image-quality and so on ?
Thanks in advance for the Help !
derfreak
To summarize what others have already stated and elaborate in more detail,
As currently implemented in OpenCV, the 8-point algorithm has no outlier rejection. It is a least-squares algorithm and cannot be used with RANSAC or LMEDS because these flags override the 8-point flag. It is recommended that the input points are normalized to improve the condition number of the matrix in the linear equation, as stated in "In Defence of the 8-point Algorithm". However, the OpenCV implementation automatically normalizes the input points, so there is no need to normalize them manually.
The 5-point and 7-point algorithms both have outlier rejection, using RANSAC or LMEDS. If you are using RANSAC, you may need to tune the threshold to get good results. The OpenCV documentation shows that the default threshold for RANSAC is 1.0, which in my opinion is a bit large. I might recommend using something around 0.1 pixels. On the other hand, if you are using LMEDS you won't need to worry about the threshold, because LMEDS minimizes the median error instead of counting inliers. LMEDS and RANSAC both have similar accuracy if the correct threshold is used and both have comparable computation time.
The 5-point algorithm is more robust than the 7-point algorithm because it only has 5 degrees of freedom (3 rotation and 2 for the unit-vector translation) instead of 7 (the additional 2 parameters are for the camera principle points). This minimal parameterization allows the rotation and translation to be easily extracted from the matrix using SVD and avoids the planar structure degeneracy problem.
However, in order to get accurate results with the 5-point algorithm, the focal length must be known. The paper suggests that the focal length should be known within 10%, otherwise the 5-point algorithm is no better than the other uncalibrated algorithms. If you haven't performed camera calibration before, check out the OpenCV camera calibration tutorial. Also, if you are using ROS, there is a nice camera calibration package.
When using the OpenCV findEssentialMat function I recommend first passing the pixel points to undistortPoints. This not only reverses the effect of lens distortion, but also transforms the coordinates to normalized image coordinates. Normalized image coordinates (not to be confused with the normalization done in the 8-point algorithm) are camera agnostic coordinates that do not depend on any of the camera intrinsic parameters. They represent the angle of the bearing vector to the point in the real world. For example, a normalized image coordinate of (1, 0) would correspond to a bearing angle of 45 degrees from the optical axis of the camera in the x direction and 0 degrees in the y direction.
After using RANSAC to obtain a good hypothesis, the best estimate can be improved by using iterative robust non-linear least-squares. This is mentioned in the paper and described in more detail in "Bundle Adjustment - A Modern Synthesis". Unfortunately, it appears that the OpenCV implementation of the 5-point algorithm does not use any iterative refinement methods.
Even if your algorithm is correct, 8 point F matrix computation is very error prone due to image noise. The lesser correspondences you use the better. The best you can do is doing 5 point Essential (E) matrix computation, but that would require you to pre-calibrate the camera and convert the detected pixel image points after SIFT/SURF to normalized pixels (metric pixel locations). Then apply Nister's 5-point algorithm either from the freely available Matlab implementation or from Bundler (c++ implementation by Noah Snavely). In my experience with SfM, 5-point E matrix is much much better/stable than 7 or 8 point F matrix computation. And ofcourse do RANSAC after 5 point to get more robust estimates. Hope this helps.
The 8-point algorithm is the simplest method of computing fundamental matrix, but if care is taken you can perform it well. The key to obtain the good results is proper careful normalization of the input data before constructing the equations to solve. Many of algorithms can do it.
Pixels point coordinate must be changed to camera coordinates, I don't see that you are doing these. As I understand, your
vector<int> pointIndexes1; is expressed in the pixel coordinates.
You must known the intrinsic camera parameters, if you want get more stable results. You may find them by many methods: tutorial openCV. Then you have two options of normalize it. You may apply for your fundamental matrix,
Mat E = K.t() * F * K; where K is Intrinsic Camera Parameters.[see on Wiki]
However this assumption is not accurate. If camera calibration matrix K is known, then you may apply inverse to the point x to obtain the point expressed in camera normalized coordinates.
pointNormalize1= K.inv()*pointIndexes1 where pointIndexes1(2), z is equal 1.
In the case of the 8PA, a simple transformation of points improve and hence in the stability of the results. The suggested normalization is a translation and scaling of each image so that the centroid of the reference points is at origin of the coordinates and the RMS distance of the points from the origin is equal to ![sqrt{2}]. Note that it is recommended that the singularity condition should be enforced before denormalization.
Reference: check it if : you are still interested

displacement between two images using opencv surf

I am working on image processing with OPENCV.
I want to find the x,y and the rotational displacement between two images in OPENCV.
I have found the features of the images using SURF and the features have been matched.
Now i want to find the displacement between the images. How do I do that? Can RANSAC be useful here?
regards,
shiksha
Rotation and two translations are three unknowns so your min number of matches is two (since each match delivers two equations or constraints). Indeed imagine a line segment between two points in one image and the corresponding (matched) line segment in another image. The difference between segments' orientations gives you a rotation angle. After you rotated just use any of the matched points to find translation. Thus this is 3DOF problem that requires two points. It is called Euclidean transformation or rigid body transformation or orthogonal Procrustes.
Using Homography (that is 8DOF problem ) that has no close form solution and relies on non-linear optimization is a bad idea. It is slow (in RANSAC case) and inaccurate since it adds 5 extra DOF. RANSAC is only needed if you have outliers. In the case of pure noise and overdetrmined system (more than 2 points) your optimal solution that minimizes the sum of squares of geometric distance between matched points is given in a close form by:
Problem statement: min([R*P+t-Q]2), R-rotation, t-translation
Solution: R = VUT, t = R*Pmean-Qmean
where X=P-Pmean; Y=Q-Qmean and we take SVD to get X*YT=ULVT; all matrices have data points as columns. For a gentle intro into rigid transformations see this

Rigid motion estimation

Now what I have is the 3D point sets as well as the projection parameters of the camera. Given two 2D point sets projected from the 3D point by using the camera and transformed camera(by rotation and translation), there should be an intuitive way to estimate the camera motion...I read some parts of Zisserman's book "Muliple view Geometry in Computer Vision", but I still did not get the solution..
Are there any hints, how can the rigid motion be estimated in this case?
THANKS!!
What you are looking for is a solution to the PnP problem. OpenCV has a function which should work called solvePnP. Just to be clear, for this to work you need point locations in world space, a camera matrix, and the points projections onto the image plane. It will then tell you the rotation and translation of the camera or points depending on how you choose to think of it.
Adding to the previous answer, Eigen has an implementation of Umeyama's method for estimation of the rigid transformation between two sets of 3d points. You can use it to get an initial estimation, and then refine it using an optimization algorithm and considering the projections of the 3d points onto the images too. For example, you could try to minimize the reprojection error between 2d points on the first image and projections of the 3d points after you bring them from the reference frame of one camera to the the reference frame of the other using the previously estimated transformation. You can do this in both ways, using the transformation and its inverse, and try to minimize the bidirectional reprojection error. I'd recommend the paper "Stereo visual odometry for autonomous ground robots", by Andrew Howard, as well as some of its references for a better explanation, especially if you are considering an outlier removal/inlier detection step before the actual motion estimation.