How to calculate Rotation and Translation matrices from homography? - computer-vision

I have already done the comparison of 2 images of same scene which are taken by one camera with different view angles(say left and right) using SURF in emgucv (C#). And it gave me a 3x3 homography matrix for 2D transformation. But now I want to make those 2 images in 3D environment (using DirectX). To do that I need to calculate relative location and orientation of 2nd image(right) to the 1st image(left) in 3D form. How can I calculate Rotation and Translate matrices for 2nd image?
I need also z value for 2nd image.
I read something called 'Homograhy decomposition'. Is it the way?
Is there anybody who familiar with homography decomposition and is there any algorithm which it implement?
Thanks in advance for any help.

Homography only works for planar scenes (ie: all of your points are coplanar). If that is the case then the homography is a projective transformation and it can be decomposed into its components.
But if your scene isn't coplanar (which I think is the case from your description) then it's going to take a bit more work. Instead of a homography you need to calculate the fundamental matrix (which emgucv will do for you). The fundamental matrix is a combination of the camera intrinsic matrix (K), the relative rotation (R) and translation (t) between the two views. Recovering the rotation and translation is pretty straight forward if you know K. It looks like emgucv has methods for camera calibration. I am not familiar with their particular method but these generally involve taking several images of a scene with know geometry.

To figure out camera motion (exact rotation and translation up to a scaling factor) you need
Calculate fundamental matrix F, for example, using eight-point
algorithm
Calculate Essential matrix E = A’FA, where A is intrinsic camera matrix
Decompose E which is by definition Tx * R via SVD into E=ULV’
Create a special 3x3 matrix
0 -1 0
W = 1 0 0
0 0 1
that helps to run decomposition:
R = UW-1VT, Tx = ULWUT, where
0 -tx ty
Tx = tz 0 -tx
-ty tx 0
Since E can have an arbitrary sign and W can be replace by Winv we have 4 distinct solution and have to select the one which produces most points in front of the camera.

It's been a while since you asked this question. By now, there are some good references on this problem.
One of them is "invitation to 3D image" by Ma, chapter 5 of it is free here http://vision.ucla.edu//MASKS/chapters.html
Also, Vision Toolbox of Peter Corke includes the tools to perform this. However, he does not explain much math of the decomposition

Related

Obtaining world coordinates of an object from its image coordinates

I have been following this documentation to use OpenCV. In the formula below, I have successfully calculated both the intrinsic as well as the extrinsic matrices(I have made use of the solvePnP() procedure to obtain these matrices). Since, the object is lying on the ground I have substituted Z = 0. Then, I just removed the third column of the extrinsic matrix and multiplied it with intrinsic matrix to obtain a 3X3 projection matrix. I took it's inverse, and multiplied it by image coordinates i.e. su,sv and s.
However, all points in the world coordinates seem to be off by 1 mm or lesser, and hence I am getting not so accurate co-ordinates. Does anyone know where I might be going wrong?
Thanks
The camera calibration will probably always somewhat inaccurate, because for more than 2 calibration images instead of getting one true solution to equation system acquired from calibration images, You get the solution with the smallest error.
The same goes to cv::solvePnP() . You use one of three methods of optimising the many possible solutions for given equation system.
I do not understand how did You get the intrinsic and extrinsic matrices from cv::solvePnP() , which is used to calculate the rotation and translation of the object in camera coordinate system.
What You can do:
Try to get better intrinsic parameters
Try other methods for solvePnP like EPNP or check the RANSAC version

camera extrinsic calibration

I have a fisheye camera, which I have already calibrated. I need to calculate the camera pose w.r.t a checkerboard just by using a single image of said checkerboard,the intrinsic parameters, and the size of the squares of the checkerboards. Unfortunately many calibration libraries first calculate the extrinsic parameters from a set of images and then the intrinsic parameters, which is essentially the "inverse" procedure of what I want. Of course I can just put my checkerboard image inside the set of other images I used for the calibration and run the calib procedure again, but it's very tedious, and moreover, I can't use a checkerboard of different size from the ones used for the instrinsic calibration. Can anybody point me in the right direction?
EDIT: After reading francesco's answer, I realized that I didn't explain what I mean by calibrating the camera. My problem begins with the fact that I don't have the classic intrinsic parameters matrix (so I can't actually use the method Francesco described).In fact I calibrated the fisheye camera with the Scaramuzza's procedure (https://sites.google.com/site/scarabotix/ocamcalib-toolbox), which basically finds a polynom which maps 3d world points into pixel coordinates( or, alternatively, the polynom which backprojects pixels to the unit sphere). Now, I think these information are enough to find the camera pose w.r.t. a chessboard, but I'm not sure exactly how to proceed.
the solvePnP procedure calculates extrinsic pose for Chess Board (CB) in camera coordinates. openCV added a fishEye library to its 3D reconstruction module to accommodate significant distortions in cameras with a large field of view. Of course, if your intrinsic matrix or transformation is not a classical intrinsic matrix you have to modify PnP:
Undo whatever back projection you did
Now you have so-called normalized camera where intrinsic matrix effect was eliminated.
k*[u,v,1]T = R|T * [x, y, z, 1]T
The way to solve this is to write the expression for k first:
k=R20*x+R21*y+R22*z+Tz
then use the above expression in
k*u = R00*x+R01*y+R02*z+Tx
k*v = R10*x+R11*y+R12*z+Tx
you can rearrange the terms to get Ax=0, subject to |x|=1, where unknown
x=[R00, R01, R02, Tx, R10, R11, R12, Ty, R20, R21, R22, Tz]T
and A, b
are composed of known u, v, x, y, z - pixel and CB corner coordinates;
Then you solve for x=last column of V, where A=ULVT, and assemble rotation and translation matrices from x. Then there are few ‘messy’ steps that are actually very typical for this kind of processing:
A. Ensure that you got a real rotation matrix - perform orthogonal Procrustes on your R2 = UVT, where R=ULVT
B. Calculate scale factor scl=sum(R2(i,j)/R(i,j))/9;
C. Update translation vector T2=scl*T and check for Tz>0; if it is negative invert T and negate R;
Now, R2, T2 give you a good starting point for non linear algorithm optimization such as Levenberg Marquardt. It is required because a previous linear step optimizes only an algebraic error of parameters while non-linear one optimizes a correct metrics such as squared error in pixel distances. However, if you don’t want to follow all these steps you can take advantage of the fish-eye library of openCV.
I assume that by "calibrated" you mean that you have a pinhole model for your camera.
Then the transformation between your chessboard plane and the image plane is a homography, which you can estimate from the image of the corners using the usual DLT algorithm. You can then express it as the product, up to scale, of the matrix of intrinsic parameters A and [x y t], where x and y columns are the x and y unit vectors of the world's (i.e. chessboard's) coordinate frame, and t is the vector from the camera centre to the origin of that same frame. That is:
H = scale * A * [x|y|t]
Therefore
[x|y|t] = 1/scale * inv(A) * H
The scale is chosen so that x and y have unit length. Once you have x and y, the third axis is just their cross product.

projection matrix from homography

I'm working on stereo-vision with the stereoRectifyUncalibrated() method under OpenCV 3.0.
I calibrate my system with the following steps:
Detect and match SURF feature points between images from 2 cameras
Apply findFundamentalMat() with the matching paairs
Get the rectifying homographies with stereoRectifyUncalibrated().
For each camera, I compute a rotation matrix as follows:
R1 = cameraMatrix[0].inv()*H1*cameraMatrix[0];
To compute 3D points, I need to get projection matrix but i don't know how i can estimate the translation vector.
I tried decomposeHomographyMat() and this solution https://stackoverflow.com/a/10781165/3653104 but the rotation matrix is not the same as what I get with R1.
When I check the rectified images with R1/R2 (using initUndistortRectifyMap() followed by remap()), the result seems correct (I checked with epipolar lines).
I am a little lost with my weak knowledge in vision. Thus if somebody could explain to me. Thank you :)
The code in the link that you have provided (https://stackoverflow.com/a/10781165/3653104) computes not the Rotation but 3x4 pose of the camera.
The last column of the pose is your Translation vector

Compensate rotation from successive images

Lets say I have image1 and image2 obtained from a webcam. For taking the image2, the webcam undergoes a rotation (yaw, pitch, roll) and a translation.
What I want: Remove the rotation from image2 so that only the translation remains (to be precise: my tracked points (x,y) from image2 will be rotated to the the same values as in image1 so that only the translation component remains).
What I have done/tried so far:
I tracked corresponding features from image1 and image2.
Calculated the fundamental matrix F with RANSAC to remove outliers.
Calibrated the camera so that I got a CAM_MATRIX (fx, fy and so on).
Calculated Essential Matrix from F with CAM_Matrix (E = cam_matrix^t * F * cam_matrix)
Decomposed the E matrix with OpenCV's SVD function so that I have a rotation matrix and translation vector.
-I know that there are 4 combinations and only 1 is the right translation vector/rotation matrix.
So my thought was: I know that the camera movement from image1 to image2 won't be more than lets say about 20°/AXIS so I can eliminate at least 2 possibilities where the angles are too far off.
For the 2 remaining I have to triangulate the points and see which one is the correct one (I have read that I only need 1 , but due possible errors/outliers it should be done with some more to be sure which one is the right). I think I could use the OpenCV's triangulation function for this? Is my thought right so far? Do I need to calculate the projection error?
Let's move on and assume that I finally obtained the right R|t matrix.
How do I continue? I tried to multiply the normal, as well as transposed rotation matrix which should reverse the rotation (?) (for testing purpose I just tried both possible combinations of R|t, I have not done the triangulation in code yet) with a tracked point in image2. but the calculated point is way too far off from what it should be. Do I need the calibration matrix here as well?
So how can I invert the rotation applied to image2? (to be exact, apply the inverse rotation to my std::vector<cv::Point2f> array which contains the tracked (x,y) points from image2)
Displaying the de-rotated image would be also nice to have. This is done with warpPerspective function? Like in this post ?
(I just don't fully understand what the purpose of A1/A2 and dist in the T matrix is or how I can adopt this solution to solve my problem.)

Rotation Matrices of Coordinate Systems ( Euler Angles, Handed-ness)

I have a coordinate system in which the orientation of a camera is represented as R=Rz(k) * Ry(p) * Rx(o) where R is a 3x3 matrix of the composition of 3x3 rotation matrices around each of X,Y,Z-axis. Moreover, I have a convention in which Z-axis is in the viewing direction of the camera. The X-axis is left-right and the Y-axis is bottom-up.
This R matrix is used in a multi-view stereo reconstruction algorithm. I have a data test set which comes with pre-calibrated camera information. I want to use the R matrices that come with this data set. However, I have no idea what kind of rotation order they assume or even their handed-ness.
How would I be able to figure this out? Any ideas?
R=Rz(k) * Ry(p) * Rx(o)
This is a very instable way of doing it. Euler angles are prone to go into gimbal lock, so I strongly advise against their use.
How would I be able to figure this out?
Well, this problem is difficult to express in a closed solution. Your best bet is treating this as a optimization problem in 3 space, where you try find the values for k, p and o to match up with the given rotation matrix R. There are 3 possible permutations on the evaulation order, so you do that optimization for all 3 of them and take the best matching result.