I am working on a Structure from Motion framework for generating 3D models of moving objects with fixed cameras. To do this, I am following this pipeline:
Obtain the fundamental matrix F through keypoints. As I am still in development and I can not (yet) access the final objects I will model, I am doing this with a small object and manually annotated keypoint pairs between two images:
These are the correspondences used to calculate F. Computing the error of F by x'Fx where x' are the points from the right image and x the points of the left image (in pixel coordinates) gives an error of 0,1196.
Compute the essential matrix E using the intrinsic matrices and the fundamental: E = Kleft'*F*Kright where Kleft' is the inverse intrinsic matrix of the left camera. With the SVD decomposition I create a new E, which has only two singular values equal to 1.
Decompose E to get R and t. To do so, I have used a self made custom version of OpenCV recoverPose() function, that allows for two different camera matrices.
Obtain Pleft as a diagonal matrix and Pright as the construction of R and t.
Here are the significant parts of the code:
F = findFundamentalMat( alignedLeft.points, alignedRight.points, mask, RANSAC);
E = cameraMatrixLeftInv*F*cameraMatrixRight;
SVD::compute(E, w, u, vt);
Mat diag = Mat::zeros(Size(3,3),6);
diag.at<double>(0,0) = 1;
diag.at<double>(1,1) = 1;
E = u*diag*vt;
int good = recoverPoseFromTwoCameras(E,alignedLeft.points,alignedRight.points,intrinsicsLeft.K,intrinsicsRight.K,R,t,mask);
Pleft = Matx34f::eye();
Pright = Matx34f(R.at<double>(0,0), R.at<double>(0,1), R.at<double>(0,2), t.at<double>(0),
R.at<double>(1,0), R.at<double>(1,1), R.at<double>(1,2), t.at<double>(1),
R.at<double>(2,0), R.at<double>(2,1), R.at<double>(2,2), t.at<double>(2));
Then I use viz to visualize the camera poses:
viz::Viz3d myWindow("Results");
viz::WCameraPosition cameraLeft(imgLeft->getIntrinsics().K,imgLeft->getImage());
viz::WCameraPosition cameraRight(imgRight->getIntrinsics().K,imgRight->getImage());
Which I place in the viewer using Pleft and Pright:
myWindow.showWidget("cameraLeft",cameraLeft,Affine3d(Pleft(Range::all(),Range(0,3)),Pleft.col(3)));
myWindow.showWidget("cameraRight",cameraRight,Affine3d(Pright(Range::all(),Range(0,3)),Pright.col(3)));
However, if I do that, the result is inverted. I can't embed more than one link because of low reputation, but the camera1 is where the camera 2 should be and viceversa.
But if I apply the matrices like this:
myWindow.showWidget("cameraLeft",cameraLeft,Affine3d(Pright(Range::all(),Range(0,3)),Pleft.col(3)));
myWindow.showWidget("cameraRight",cameraRight,Affine3d(Pleft(Range::all(),Range(0,3)),Pleft.col(3)));
The result is correct.
What am I missing?
Hope you have worked out! If you have not, you'd better upload more data. For example, the features, the F/E you computed, the camera intrinsic. Then we can test your code and try to find your bug, otherwise, there are many possible for this.
Related
I'm following this tutorial, which uses Features2D + Homography. If I have known camera matrix for each image, how can I optimize the result? I tried some images, but it didn't work well.
//Edit
After reading some materials, I think I should rectify two image first. But the rectification is not perfect, so a vertical line on image 1 correspond a vertical band on image 2 generally. Are there any good algorithms?
I'm not sure if I understand your problem. You want to find corresponding points between the images or you want to improve the correctness of your matches by use of the camera intrinsics?
In principle, in order to use camera geometry for finding matches, you would need the fundamental or essential matrix, depending on wether you know the camera intrinsics (i.e. calibrated camera). That means, you would need an estimate for the relative rotation and translation of the camera. Then, by computing the epipolar lines corresponding to the features found in one image, you would need to search along those lines in the second image to find the best match. However, I think it would be better to simply rely on automatic feature matching. Given the fundamental/essential matrix, you could try your luck with correctMatches, which will move the correspondences such that the reprojection error is minimised.
Tips for better matches
To increase the stability and saliency of automatic matches, it usually pays to
Adjust the parameters of the feature detector
Try different detection algorithms
Perform a ratio test to filter out those keypoints which have a very similar second-best match and are therefore unstable. This is done like this:
Mat descriptors_1, descriptors_2; // obtained from feature detector
BFMatcher matcher;
vector<DMatch> matches;
matcher = BFMatcher(NORM_L2, false); // norm depends on feature detector
vector<vector<DMatch>> match_candidates;
const float ratio = 0.8; // or something
matcher.knnMatch(descriptors_1, descriptors_2, match_candidates, 2);
for (int i = 0; i < match_candidates.size(); i++)
{
if (match_candidates[i][0].distance < ratio * match_candidates[i][1].distance)
matches.push_back(match_candidates[i][0]);
}
A more involved way of filtering would be to compute the reprojection error for each keypoint in the first frame. This means to compute the corresponding epipolar line in the second image and then checking how far its supposed matching point is away from that line. Throwing away those points whose distance exceeds some threshold would remove the matches which are incompatible with the epiploar geometry (which I assume would be known). Computing the error can be done like this (I honestly do not remember where I took this code from and I may have modified it a bit, also the SO editor is buggy when code is inside lists, sorry for the bad formatting):
double computeReprojectionError(vector& imgpts1, vector& imgpts2, Mat& inlier_mask, const Mat& F)
{
double err = 0;
vector lines[2];
int npt = sum(inlier_mask)[0];
// strip outliers so validation is constrained to the correspondences
// which were used to estimate F
vector imgpts1_copy(npt),
imgpts2_copy(npt);
int c = 0;
for (int k = 0; k < inlier_mask.size().height; k++)
{
if (inlier_mask.at(0,k) == 1)
{
imgpts1_copy[c] = imgpts1[k];
imgpts2_copy[c] = imgpts2[k];
c++;
}
}
Mat imgpt[2] = { Mat(imgpts1_copy), Mat(imgpts2_copy) };
computeCorrespondEpilines(imgpt[0], 1, F, lines[0]);
computeCorrespondEpilines(imgpt1, 2, F, lines1);
for(int j = 0; j < npt; j++ )
{
// error is computed as the distance between a point u_l = (x,y) and the epipolar line of its corresponding point u_r in the second image plus the reverse, so errij = d(u_l, F^T * u_r) + d(u_r, F*u_l)
Point2f u_l = imgpts1_copy[j], // for the purpose of this function, we imagine imgpts1 to be the "left" image and imgpts2 the "right" one. Doesn't make a difference
u_r = imgpts2_copy[j];
float a2 = lines1[j][0], // epipolar line
b2 = lines1[j]1,
c2 = lines1[j][2];
float norm_factor2 = sqrt(pow(a2, 2) + pow(b2, 2));
float a1 = lines[0][j][0],
b1 = lines[0][j]1,
c1 = lines[0][j][2];
float norm_factor1 = sqrt(pow(a1, 2) + pow(b1, 2));
double errij =
fabs(u_l.x * a2 + u_l.y * b2 + c2) / norm_factor2 +
fabs(u_r.x * a1 + u_r.y * b1 + c1) / norm_factor1; // distance of (x,y) to line (a,b,c) = ax + by + c / (a^2 + b^2)
err += errij; // at this point, apply threshold and mark bad matches
}
return err / npt;
}
The point is, grab the fundamental matrix, use it to compute epilines for all the points and then compute the distance (the lines are given in a parametric form so you need to do some algebra to get the distance). This is somewhat similar in outcome to what findFundamentalMat with the RANSAC method does. It returns a mask wherein for each match there is either a 1, meaning that it was used to estimate the matrix, or a 0 if it was thrown out. But estimating the fundamental Matrix like this will probably be less accurate than using chessboards.
EDIT: Looks like oarfish beat me to it, but I'll leave this here.
The fundamental matrix (F) defines a mapping from a point in the left image to a line in the right image on which the corresponding point must lie, assuming perfect calibration. This is the epipolar line, i.e. the line though the point in the left image and the two epipoles of the stereo camera pair. For references, see these lecture notes and this chapter of the HZ book.
Given a set of point correspondences in the left and right images: (p_L, p_R), from SURF (or any other feature matcher), and given F, the constraint from epipolar geometry of the stereo pair says that p_R should lie on the epipolar line projected by p_L onto the right image, i.e.
In practice, calibration errors from noise as well as erroneous feature matches lead to a non-zero value.
However, using this idea, you can then perform outlier removal by rejecting those feature matches for which this equation is greater than a certain threshold value, i.e. reject (p_L, p_R) if and only if:
When selecting this threshold, keep in mind that it is the distance in image space of a point from an epipolar line that you are willing to tolerate, which in some sense is your epipolar error tolerance.
Degenerate case: To visually imagine what this means, let us assume that the stereo pair differ only in a pure X-translation. Then the epipolar lines are horizontal. This means that you can connect the feature matched point pairs by a line and reject those pairs whose line slope is not close to zero. The equation above is a generalization of this idea to arbitrary stereo rotation and translation, which is accounted for by the matrix F.
Your specific images: It looks like your feature matches are sparse. I suggest instead to use a dense feature matching approach so that after outlier removal, you are still left with a sufficient number of good-quality matches. I'm not sure which dense feature matcher is already implemented in OpenCV, but I suggest starting here.
Giving your pictures, your are trying to do a stereo matching.
This page will be helpfull. The rectification you want can be done using stereoCalibrate then stereoRectify.
The result (from the doc):
In order to find the Fundamental Matrix, you need correct correspondances, but in order to get good correspondances, you need a good estimate of the fundamental matrix. This might sound like an impossible chicken-and-the-egg-problem, but there is well established methods to do this; RANSAC.
It randomly selects a small set of correspondances, uses those to calculate a fundamental matrix (using the 7 or 8 point algorithm) and then tests how many of the other correspondences that comply with this matrix (using the method described by scribbleink for measuring the distance between point and epipolar line). It keeps testing new combinations of correspondances for a certain number of iterations and selects the one with the most inliers.
This is already implemented in OpenCV as cv::findFundamentalMat (http://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html#findfundamentalmat). Select the method CV_FM_RANSAC to use ransac to remove bad correspondances. It will output a list of all the inlier correspondances.
The requirement for this is that all the points does not lie on the same plane.
I'm trying to get 3D coordinates of several points in space, but I'm getting odd results from both undistortPoints() and triangulatePoints().
Since both cameras have different resolution, I've calibrated them separately, got RMS errors of 0,34 and 0,43, then used stereoCalibrate() to get more matrices, got an RMS of 0,708, and then used stereoRectify() to get remaining matrices. With that in hand I've started the work on gathered coordinates, but I get weird results.
For example, input is: (935, 262), and the undistortPoints() output is (1228.709125, 342.79841) for one point, while for another it's (934, 176) and (1227.9016, 292.4686) respectively. Which is weird, because both of these points are very close to the middle of the frame, where distortions are the smallest. I didn't expect it to move them by 300 pixels.
When passed to traingulatePoints(), the results get even stranger - I've measured the distance between three points in real life (with a ruler), and calculated the distance between pixels on each picture. Because this time the points were on a pretty flat plane, these two lengths (pixel and real) matched, as in |AB|/|BC| in both cases was around 4/9. However, triangulatePoints() gives me results off the rails, with |AB|/|BC| being 3/2 or 4/2.
This is my code:
double pointsBok[2] = { bokList[j].toFloat()+xBok/2, bokList[j+1].toFloat()+yBok/2 };
cv::Mat imgPointsBokProper = cv::Mat(1,1, CV_64FC2, pointsBok);
double pointsTyl[2] = { tylList[j].toFloat()+xTyl/2, tylList[j+1].toFloat()+yTyl/2 };
//cv::Mat imgPointsTyl = cv::Mat(2,1, CV_64FC1, pointsTyl);
cv::Mat imgPointsTylProper = cv::Mat(1,1, CV_64FC2, pointsTyl);
cv::undistortPoints(imgPointsBokProper, imgPointsBokProper,
intrinsicOne, distCoeffsOne, R1, P1);
cv::undistortPoints(imgPointsTylProper, imgPointsTylProper,
intrinsicTwo, distCoeffsTwo, R2, P2);
cv::triangulatePoints(P1, P2, imgWutBok, imgWutTyl, point4D);
double wResult = point4D.at<double>(3,0);
double realX = point4D.at<double>(0,0)/wResult;
double realY = point4D.at<double>(1,0)/wResult;
double realZ = point4D.at<double>(2,0)/wResult;
The angles between points are kinda sorta good but usually not:
`7,16816 168,389 4,44275` vs `5,85232 170,422 3,72561` (degrees)
`8,44743 166,835 4,71715` vs `12,4064 158,132 9,46158`
`9,34182 165,388 5,26994` vs `19,0785 150,883 10,0389`
I've tried to use undistort() on the entire frame, but got results just as odd. The distance between B and C points should be pretty much unchanged at all times, and yet this is what I get:
7502,42
4876,46
3230,13
2740,67
2239,95
Frame by frame.
Pixel distance (bottom) vs real distance (top) - should be very similar:
Angle:
Also, shouldn't both undistortPoints() and undistort() give the same results (another set of videos here)?
The function cv::undistort does undistortion and reprojection in one go. It performs the following list of operations:
undo camera projection (multiplication with the inverse of the camera matrix)
apply the distortion model to undo the distortion
rotate by the provided Rotation matrix R1/R2
project points to image using the provided Projection matrix P1/P2
If you pass the matrices R1, P1 resp. R2, P2 from cv::stereoCalibrate(), the input points will be undistorted and rectified. Rectification means that the images are transformed in a way such that corresponding points have the same y-coordinate. There is no unique solution for image rectification, as you can apply any translation or scaling to both images, without changing the alignement of corresponding points.
That being said, cv::stereoCalibrate() can shift the center of projection quite a bit (e.g. 300 pixels). If you want pure undistortion you can pass an Identity Matrix (instead of R1) and the original camera Matrix K (instead of P1). This should lead to pixel coordinates similar to the original ones.
I am trying to calculate scale, rotation and translation between two consecutive frames of a video. So basically I matched keypoints and then used opencv function findHomography() to calculate the homography matrix.
homography = findHomography(feature1 , feature2 , CV_RANSAC); //feature1 and feature2 are matched keypoints
My question is: How can I use this matrix to calculate scale, rotation and translation?.
Can anyone provide me the code or explanation as to how to do it?
if you can use opencv 3.0, this decomposition method is available
http://docs.opencv.org/3.0-beta/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html#decomposehomographymat
The right answer is to use homography as it is defined dst = H ⋅ src and explore what it does to small segments around a particular point.
Translation
Given a single point, for translation do
T = dst - (H ⋅ src)
Rotation
Given two points p1 and p2
p1 = H ⋅ p1
p2 = H ⋅ p2
Now just calculate the angle between vectors p1 p2 and p1' p2'.
Scale
You can use the same trick but now just compare the lengths: |p1 p2| and |p1' p2'|.
To be fair, use another segment orthogonal to the first and average the result. You will see that there is no constant scale factor or translation one. They will depend on the src location.
Given Homography matrix H:
|H_00, H_01, H_02|
H = |H_10, H_11, H_12|
|H_20, H_21, H_22|
Assumptions:
H_20 = H_21 = 0 and normalized to H_22 = 1 to obtain 8 DOF.
The translation along x and y axes are directly calculated from H:
tx = H_02
ty = H_12
The 2x2 sub matrix on the top left corner is decomposed to calculate shear, scaling and rotation. An easy and quick decomposition method is explained here.
Note: this method assumes invertible matrix.
Since i had to struggle for a couple of days to create my homography transformation function I'm going to put it here for the benefit of everyone.
Here you can see the main loop where every input position is multiplied by the homography matrix h. Then the result is used to copy the pixel from the original position to the destination position.
for (tempIn[0] = 0; tempIn[0] < stride; tempIn[0]++)
{
for (tempIn[1] = 0; tempIn[1] < rows; tempIn[1]++)
{
double w = h[6] * tempIn[0] + h[7] * tempIn[1] + 1; // very important!
//H_20 = H_21 = 0 and normalized to H_22 = 1 to obtain 8 DOF. <-- this is wrong
tempOut[0] = ((h[0] * tempIn[0]) + (h[1] * tempIn[1]) + h[2])/w;
tempOut[1] =(( h[3] * tempIn[0]) +(h[4] * tempIn[1]) + h[5])/w;
if (tempOut[1] < destSize && tempOut[0] < destSize && tempOut[0] >= 0 && tempOut[1] >= 0)
dest_[destStride * tempOut[1] + tempOut[0]] = src_[stride * tempIn[1] + tempIn[0]];
}
}
After such process an image with some kind of grid will be produced. Some kind of filter is needed to remove the grid. In my code i have used a simple linear filter.
Note: Only the central part of the original image is really required for producing a correct image. Some rows and columns can be safely discarded.
For estimating a tree-dimensional transform and rotation induced by a homography, there exist multiple approaches. One of them provides closed formulas for decomposing the homography, but they are very complex. Also, the solutions are never unique.
Luckily, OpenCV 3 already implements this decomposition (decomposeHomographyMat). Given an homography and a correctly scaled intrinsics matrix, the function provides a set of four possible rotations and translations.
The question seems to be about 2D parameters. Homography matrix captures perspective distortion. If the application does not create much perspective distortion, one can approximate a real world transformation using affine transformation matrix (that uses only scale, rotation, translation and no shearing/flipping). The following link will give an idea about decomposing an affine transformation into different parameters.
https://math.stackexchange.com/questions/612006/decomposing-an-affine-transformation
I am using this legacy code: http://fossies.org/dox/opencv-2.4.8/trifocal_8cpp_source.html
for estimating 3D points from the given corresponding 2D points from 3 different views. The problem I faced is same as stated here: http://opencv-users.1802565.n2.nabble.com/trifocal-tensor-icvComputeProjectMatrices6Points-icvComputeProjectMatricesNPoints-td2423108.html
I could compute Projection matrices successfully using icvComputeProjectMatrices6Points. I used 6 set of corresponding points from 3 views. Results are shown below:
projMatr1 P1 =
[-0.22742541, 0.054754492, 0.30500898, -0.60233182;
-0.14346679, 0.034095913, 0.33134204, -0.59825808;
-4.4949986e-05, 9.9166318e-06, 7.106331e-05, -0.00014547621]
projMatr2 P2 =
[-0.17060626, -0.0076031247, 0.42357284, -0.7917347;
-0.028817834, -0.0015948272, 0.2217239, -0.33850163;
-3.3046148e-05, -1.3680664e-06, 0.0001002633, -0.00019192585]
projMatr3 P3 =
[-0.033748217, 0.099119112, -0.4576003, 0.75215244;
-0.001807699, 0.0035084449, -0.24180284, 0.39423448;
-1.1765103e-05, 2.9554356e-05, -0.00013438619, 0.00025332544]
Furthermore, I computed 3D points using icvReconstructPointsFor3View. The six 3D points are as following:
4D points =
[-0.4999997, -0.26867214, -1, 2.88633e-07, 1.7766099e-07, -1.1447386e-07;
-0.49999994, -0.28693244, 3.2249036e-06, 1, 7.5971762e-08, 2.1956141e-07;
-0.50000024, -0.72402155, 1.6873783e-07, -6.8603946e-08, -1, 5.8393886e-07;
-0.50000012, -0.56681377, 1.202426e-07, -4.1603233e-08, -2.3659911e-07, 1]
While, actual 3D points are as following:
- { ID:1,X:500.000000, Y:800.000000, Z:3000.000000}
- { ID:2,X:500.000000, Y:800.000000, Z:4000.000000}
- { ID:3,X:1500.000000, Y:800.000000, Z:4000.000000}
- { ID:4,X:1500.000000, Y:800.000000, Z:3000.000000}
- { ID:5,X:500.000000, Y:1800.000000, Z:3000.000000}
- { ID:6,X:500.000000, Y:1800.000000, Z:4000.000000}
My question is now, how to transform P1, P2 and P3 to a form that allows
a meaningful triangulation? I need to compute the correct 3D points using trifocal tensor.
The trifocal tensor won't help you, because like the fundamental matrix, it only enables projective reconstruction of the scene and camera poses. If X0_j and P0_i are the true 3D points and camera matrices, this means that the reconstructed points Xp_j = inv(H).X0_j and camera matrices Pp_i = P0_i.H are only defined up to a common 4x4 matrix H, which is unknown.
In order to obtain a metric reconstruction, you need to know the calibration matrices of your cameras. Whether you know these matrices (e.g. if you use virtual cameras for image rendering) or you estimated them using camera calibration (see OpenCV calibration tutorials), you can find a method to obtain a metric reconstruction in §7.4.5 of "Geometry, constraints and computation of the trifocal tensor", by C.Ressl (PDF).
Note that even when using this method, you cannot obtain an up-to-scale 3D reconstruction, unless you have some additional knowledge (such as knowledge of the actual distance between two fixed 3D points).
Sketch of the algorithm:
Inputs: the three camera matrices P1, P2, P3 (projective world coordinates, with the coordinate system chosen so that P1=[I|0]), the associated calibration matrices K1, K2, K3 and one point correspondence x1, x2, x3.
Outputs: the three camera matrices P1_E, P2_E, P3_E (metric reconstruction).
Set P1_E=K1.[I|0]
Compute the fundamental matrices F21, F31. Denoting P2=[A|a] and P3=[B|b], you have F21=[a]x.A and F31=[b]x.B (see table 9.1 in [HZ00]), where for a 3x1 vector e [e]x = [0,-e_3,e_2;e_3,0,-e_1;-e_2,e_1,0]
Compute the essential matrices E21 = K2'.F21.K1 and E31 = K3'.F31.K1
For i = 2,3, do the following
i. Compute the SVD Ei1=U.S.V'. If det(U)<0 set U=-U. If det(V)<0 set V=-V.
ii. Define W=[0,-1,0;1,0,0;0,0,1], Ri=U.W.V' and ti = third column of U
iii. Define M=[Ri'.ti]x, X1=M.inv(K1).x1 and Xi=M.Ri'.inv(Ki).xi
iv. If X1_3.Xi_3<0, set Ri=U.W'.V' and recompute M and X1
v. If X1_3<0 set ti = -ti
vi. Define Pi_E=Ki.[Ri|ti]
Do the following to retrieve the correct scale for t3 (consistantly to the fact that ||t2||=1):
i. Define p2=R2'.inv(K2).x2 and p3=R3'.inv(K3).x3
ii. Define M=[p2]x
iii. Compute the scale s=(p3'.M.R2'.t2)/(p3'.M.R3'.t3)
iv. Set t3=t3*s
End of the algorithm: the camera matrices P1_E, P2_E, P3_E are valid up to an isotropic scaling of the scene and a change of 3D coordinate system (hence it is a metric reconstruction).
[HZ00] "Multiple view geometry in computer vision" , by R.Hartley and A.Zisserman, 2000.
I am doing camera calibration from tsai algo. I got intrensic and extrinsic matrix, but how can I reconstruct the 3D coordinates from that inormation?
1) I can use Gaussian Elimination for find X,Y,Z,W and then points will be X/W , Y/W , Z/W as homogeneous system.
2) I can use the
OpenCV documentation approach:
as I know u, v, R , t , I can compute X,Y,Z.
However both methods end up in different results that are not correct.
What am I'm doing wrong?
If you got extrinsic parameters then you got everything. That means that you can have Homography from the extrinsics (also called CameraPose). Pose is a 3x4 matrix, homography is a 3x3 matrix, H defined as
H = K*[r1, r2, t], //eqn 8.1, Hartley and Zisserman
with K being the camera intrinsic matrix, r1 and r2 being the first two columns of the rotation matrix, R; t is the translation vector.
Then normalize dividing everything by t3.
What happens to column r3, don't we use it? No, because it is redundant as it is the cross-product of the 2 first columns of pose.
Now that you have homography, project the points. Your 2d points are x,y. Add them a z=1, so they are now 3d. Project them as follows:
p = [x y 1];
projection = H * p; //project
projnorm = projection / p(z); //normalize
Hope this helps.
As nicely stated in the comments above, projecting 2D image coordinates into 3D "camera space" inherently requires making up the z coordinates, as this information is totally lost in the image. One solution is to assign a dummy value (z = 1) to each of the 2D image space points before projection as answered by Jav_Rock.
p = [x y 1];
projection = H * p; //project
projnorm = projection / p(z); //normalize
One interesting alternative to this dummy solution is to train a model to predict the depth of each point prior to reprojection into 3D camera-space. I tried this method and had a high degree of success using a Pytorch CNN trained on 3D bounding boxes from the KITTI dataset. Would be happy to provide code but it'd be a bit lengthy for posting here.